You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Shouheng Yi <sh...@microsoft.com.INVALID> on 2017/02/22 20:51:19 UTC

[Spark Namespace]: Expanding Spark ML under Different Namespace?

Hi Spark developers,

Currently my team at Microsoft is extending Spark's machine learning functionalities to include new learners and transformers. We would like users to use these within spark pipelines so that they can mix and match with existing Spark learners/transformers, and overall have a native spark experience. We cannot accomplish this using a non-"org.apache" namespace with the current implementation, and we don't want to release code inside the apache namespace because it's confusing and there could be naming rights issues.

We need to extend several classes from spark which happen to have "private[spark]." For example, one of our class extends VectorUDT[0] which has private[spark] class VectorUDT as its access modifier. This unfortunately put us in a strange scenario that forces us to work under the namespace org.apache.spark.

To be specific, currently the private classes/traits we need to use to create new Spark learners & Transformers are HasInputCol, VectorUDT and Logging. We will expand this list as we develop more.

Is there a way to avoid this namespace issue? What do other people/companies do in this scenario? Thank you for your help!

[0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala

Best,
Shouheng

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Nick Pentreath <ni...@gmail.com>.

Currently your only option is to write (or copy) your own implementations.

Logging is definitely intended to be internal use only, and it's best to
use your own logging lib - Typesafe scalalogging is a common option that
I've used.

As for the VectorUDT, for now that is private. There are no plans to open
it up as yet. It should not be too difficult to have your own UDT
implementation. What type of extensions are you trying to do with the UDT?

Likewise the shared params are for now private. It is a bit annoying to
have to re-create them, but most of them are pretty simple so it's not a
huge overhead.

Perhaps you can add your thoughts & comments to
https://issues.apache.org/jira/browse/SPARK-19498 in terms of extending
Spark ML. Ultimately I support making it easier to extend. But we do have
to balance that with exposing new public APIs and classes that impose
backward compat guarantees.

Perhaps now is a good time to think about some of the common shared params
for example.

Thanks
Nick


On Wed, 22 Feb 2017 at 22:51 Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

Hi Spark developers,



Currently my team at Microsoft is extending Spark’s machine learning
functionalities to include new learners and transformers. We would like
users to use these within spark pipelines so that they can mix and match
with existing Spark learners/transformers, and overall have a native spark
experience. We cannot accomplish this using a non-“org.apache” namespace
with the current implementation, and we don’t want to release code inside
the apache namespace because it’s confusing and there could be naming
rights issues.



We need to extend several classes from spark which happen to have
“private[spark].” For example, one of our class extends VectorUDT[0] which
has private[spark] class VectorUDT as its access modifier. This
unfortunately put us in a strange scenario that forces us to work under the
namespace org.apache.spark.



To be specific, currently the private classes/traits we need to use to
create new Spark learners & Transformers are HasInputCol, VectorUDT and
Logging. We will expand this list as we develop more.



Is there a way to avoid this namespace issue? What do other
people/companies do in this scenario? Thank you for your help!



[0]:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala



Best,

Shouheng

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Nick Pentreath <ni...@gmail.com>.

Also, note https://issues.apache.org/jira/browse/SPARK-7146 is linked from
SPARK-19498 specifically to discuss opening up sharedParams traits.


On Fri, 3 Mar 2017 at 23:17 Shouheng Yi <sh...@microsoft.com.invalid>
wrote:

> Hi Spark dev list,
>
>
>
> Thank you guys so much for all your inputs. We really appreciated those
> suggestions. After some discussions in the team, we decided to stay under
> apache’s namespace for now, and attach some comments to explain what we did
> and why we did this.
>
>
>
> As the Spark dev list kindly pointed out, this is an existing issue that
> was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA
> ticket to see if there are any new suggested practices that should be
> adopted in the future and make corresponding fixes.
>
>
>
> Best,
>
> Shouheng
>
>
>
> [0] https://issues.apache.org/jira/browse/SPARK-19498
>
>
>
> *From:* Tim Hunter [mailto:timhunter@databricks.com
> <ti...@databricks.com>]
> *Sent:* Friday, February 24, 2017 9:08 AM
> *To:* Joseph Bradley <jo...@databricks.com>
> *Cc:* Steve Loughran <st...@hortonworks.com>; Shouheng Yi <
> shouyi@microsoft.com.invalid>; Apache Spark Dev <de...@spark.apache.org>;
> Markus Weimer <mw...@microsoft.com>; Rogan Carr <ro...@microsoft.com>;
> Pei Jiang <pe...@microsoft.com>; Miruna Oprescu <mo...@microsoft.com>
> *Subject:* Re: [Spark Namespace]: Expanding Spark ML under Different
> Namespace?
>
>
>
> Regarding logging, Graphframes makes a simple wrapper this way:
>
>
>
>
> https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgraphframes%2Fgraphframes%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fscala%2Forg%2Fgraphframes%2FLogging.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=lNT03ZybQOrEboWz0vuX4cic%2F5WGn49E464%2B1XbqdD8%3D&reserved=0>
>
>
>
> Regarding the UDTs, they have been hidden to be reworked for Datasets, the
> reasons being detailed here [1]. Can you describe your use case in more
> details? You may be better off copy/pasting the UDT code outside of Spark,
> depending on your use case.
>
>
>
> [1] https://issues.apache.org/jira/browse/SPARK-14155
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-14155&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=I5yFehqhf5qXMPXKQj8inZa3kXQwM3O2ntea3bFlge4%3D&reserved=0>
>
>
>
> On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <jo...@databricks.com>
> wrote:
>
> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-19498&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=jByKjOBuL9elEiJNJzxeoZ5euHDfinjqzj%2FJY5hn7Xo%3D&reserved=0>
> !
>
>
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>
>
> On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>
> wrote:
>
>
>
> Hi Spark developers,
>
>
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
>
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
>
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
>
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
>
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
>
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
>
>
> To be specific, currently the private classes/traits we need to use to
> create new Spark learners & Transformers are HasInputCol, VectorUDT and
> Logging. We will expand this list as we develop more.
>
>
>
> I do think tis a shame that logging went from public to private.
>
>
>
> One thing that could be done there is to copy the logging into Bahir,
> under an org.apache.bahir package, for yourself and others to use. That's
> be beneficial to me too.
>
>
>
> For the ML stuff, that might be place to work too, if you are going to
> open source the code.
>
>
>
>
>
>
>
> Is there a way to avoid this namespace issue? What do other
> people/companies do in this scenario? Thank you for your help!
>
>
>
> I've hit this problem in the past.  Scala code tends to force your hand
> here precisely because of that (very nice) private feature. While it offers
> the ability of a project to guarantee that implementation details aren't
> picked up where they weren't intended to be, in OSS dev, all that
> implementation is visible and for lower level integration,
>
>
>
> What I tend to do is keep my own code in its package and try to do as
> think a bridge over to it from the [private] scope. It's also important to
> name things obviously, say,  org.apache.spark.microsoft , so stack traces
> in bug reports can be dealt with more easily
>
>
>
>
>
> [0]:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
> <https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fmllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fml%2Flinalg%2FVectorUDT.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=HjxQq3XAT%2FMljuNdU0MOorPhblMrnFcLezj9tebAht8%3D&reserved=0>
>
>
>
> Best,
>
> Shouheng
>
>
>
>
>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com]
> <https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=Yq5F7xzV%2B8aqAoJyF0gePMG2cghRYonz68NDNvN9vjs%3D&reserved=0>
>
>
>

RE: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.

Hi Spark dev list,

Thank you guys so much for all your inputs. We really appreciated those suggestions. After some discussions in the team, we decided to stay under apache’s namespace for now, and attach some comments to explain what we did and why we did this.

As the Spark dev list kindly pointed out, this is an existing issue that was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA ticket to see if there are any new suggested practices that should be adopted in the future and make corresponding fixes.

Best,
Shouheng

[0] https://issues.apache.org/jira/browse/SPARK-19498

From: Tim Hunter [mailto:timhunter@databricks.com]
Sent: Friday, February 24, 2017 9:08 AM
To: Joseph Bradley <jo...@databricks.com>>
Cc: Steve Loughran <st...@hortonworks.com>>; Shouheng Yi <sh...@microsoft.com.invalid>>; Apache Spark Dev <de...@spark.apache.org>>; Markus Weimer <mw...@microsoft.com>>; Rogan Carr <ro...@microsoft.com>>; Pei Jiang <pe...@microsoft.com>>; Miruna Oprescu <mo...@microsoft.com>>
Subject: Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgraphframes%2Fgraphframes%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fscala%2Forg%2Fgraphframes%2FLogging.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=lNT03ZybQOrEboWz0vuX4cic%2F5WGn49E464%2B1XbqdD8%3D&reserved=0>

Regarding the UDTs, they have been hidden to be reworked for Datasets, the reasons being detailed here [1]. Can you describe your use case in more details? You may be better off copy/pasting the UDT code outside of Spark, depending on your use case.

[1] https://issues.apache.org/jira/browse/SPARK-14155<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-14155&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=I5yFehqhf5qXMPXKQj8inZa3kXQwM3O2ntea3bFlge4%3D&reserved=0>

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <jo...@databricks.com>> wrote:
+1 for Nick's comment about discussing APIs which need to be made public in https://issues.apache.org/jira/browse/SPARK-19498<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-19498&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=jByKjOBuL9elEiJNJzxeoZ5euHDfinjqzj%2FJY5hn7Xo%3D&reserved=0> !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <st...@hortonworks.com>> wrote:

On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>> wrote:

Hi Spark developers,

Currently my team at Microsoft is extending Spark’s machine learning functionalities to include new learners and transformers. We would like users to use these within spark pipelines so that they can mix and match with existing Spark learners/transformers, and overall have a native spark experience. We cannot accomplish this using a non-“org.apache” namespace with the current implementation, and we don’t want to release code inside the apache namespace because it’s confusing and there could be naming rights issues.

This isn't actually the ASF has a strong stance against, more left to projects themselves. After all: the source is licensed by the ASF, and the license doesn't say you can't.

Indeed, there's a bit of org.apache.hive in the Spark codebase where the hive team kept stuff package private. Though that's really a sign that things could be improved there.

Where is problematic is that stack traces end up blaming the wrong group; nobody likes getting a bug report which doesn't actually exist in your codebase., not least because you have to waste time to even work it out.

You also have to expect absolutely no stability guarantees, so you'd better set your nightly build to work against trunk

Apache Bahir does put some stuff into org.apache.spark.stream, but they've sort of inherited that right.when they picked up the code from spark. new stuff is going into org.apache.bahir


We need to extend several classes from spark which happen to have “private[spark].” For example, one of our class extends VectorUDT[0] which has private[spark] class VectorUDT as its access modifier. This unfortunately put us in a strange scenario that forces us to work under the namespace org.apache.spark.

To be specific, currently the private classes/traits we need to use to create new Spark learners & Transformers are HasInputCol, VectorUDT and Logging. We will expand this list as we develop more.

I do think tis a shame that logging went from public to private.

One thing that could be done there is to copy the logging into Bahir, under an org.apache.bahir package, for yourself and others to use. That's be beneficial to me too.

For the ML stuff, that might be place to work too, if you are going to open source the code.



Is there a way to avoid this namespace issue? What do other people/companies do in this scenario? Thank you for your help!

I've hit this problem in the past.  Scala code tends to force your hand here precisely because of that (very nice) private feature. While it offers the ability of a project to guarantee that implementation details aren't picked up where they weren't intended to be, in OSS dev, all that implementation is visible and for lower level integration,

What I tend to do is keep my own code in its package and try to do as think a bridge over to it from the [private] scope. It's also important to name things obviously, say,  org.apache.spark.microsoft , so stack traces in bug reports can be dealt with more easily


[0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fmllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fml%2Flinalg%2FVectorUDT.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=HjxQq3XAT%2FMljuNdU0MOorPhblMrnFcLezj9tebAht8%3D&reserved=0>

Best,
Shouheng




--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=Yq5F7xzV%2B8aqAoJyF0gePMG2cghRYonz68NDNvN9vjs%3D&reserved=0>

RE: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Shouheng Yi <sh...@microsoft.com.INVALID>.

Hi Spark dev list,

Thank you guys so much for all your inputs. We really appreciated those suggestions. After some discussions in the team, we decided to stay under apache’s namespace for now, and attach some comments to explain what we did and why we did this.

As the Spark dev list kindly pointed out, this is an existing issue that was documented in the JIRA ticket [Spark-19498] [0]. We can follow the JIRA ticket to see if there are any new suggested practices that should be adopted in the future and make corresponding fixes.

Best,
Shouheng

[0] https://issues.apache.org/jira/browse/SPARK-19498

From: Tim Hunter [mailto:timhunter@databricks.com]
Sent: Friday, February 24, 2017 9:08 AM
To: Joseph Bradley <jo...@databricks.com>
Cc: Steve Loughran <st...@hortonworks.com>; Shouheng Yi <sh...@microsoft.com.invalid>; Apache Spark Dev <de...@spark.apache.org>; Markus Weimer <mw...@microsoft.com>; Rogan Carr <ro...@microsoft.com>; Pei Jiang <pe...@microsoft.com>; Miruna Oprescu <mo...@microsoft.com>
Subject: Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/Logging.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgraphframes%2Fgraphframes%2Fblob%2Fmaster%2Fsrc%2Fmain%2Fscala%2Forg%2Fgraphframes%2FLogging.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=lNT03ZybQOrEboWz0vuX4cic%2F5WGn49E464%2B1XbqdD8%3D&reserved=0>

Regarding the UDTs, they have been hidden to be reworked for Datasets, the reasons being detailed here [1]. Can you describe your use case in more details? You may be better off copy/pasting the UDT code outside of Spark, depending on your use case.

[1] https://issues.apache.org/jira/browse/SPARK-14155<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-14155&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=I5yFehqhf5qXMPXKQj8inZa3kXQwM3O2ntea3bFlge4%3D&reserved=0>

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <jo...@databricks.com>> wrote:
+1 for Nick's comment about discussing APIs which need to be made public in https://issues.apache.org/jira/browse/SPARK-19498<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK-19498&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=jByKjOBuL9elEiJNJzxeoZ5euHDfinjqzj%2FJY5hn7Xo%3D&reserved=0> !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <st...@hortonworks.com>> wrote:

On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>> wrote:

Hi Spark developers,

Currently my team at Microsoft is extending Spark’s machine learning functionalities to include new learners and transformers. We would like users to use these within spark pipelines so that they can mix and match with existing Spark learners/transformers, and overall have a native spark experience. We cannot accomplish this using a non-“org.apache” namespace with the current implementation, and we don’t want to release code inside the apache namespace because it’s confusing and there could be naming rights issues.

This isn't actually the ASF has a strong stance against, more left to projects themselves. After all: the source is licensed by the ASF, and the license doesn't say you can't.

Indeed, there's a bit of org.apache.hive in the Spark codebase where the hive team kept stuff package private. Though that's really a sign that things could be improved there.

Where is problematic is that stack traces end up blaming the wrong group; nobody likes getting a bug report which doesn't actually exist in your codebase., not least because you have to waste time to even work it out.

You also have to expect absolutely no stability guarantees, so you'd better set your nightly build to work against trunk

Apache Bahir does put some stuff into org.apache.spark.stream, but they've sort of inherited that right.when they picked up the code from spark. new stuff is going into org.apache.bahir


We need to extend several classes from spark which happen to have “private[spark].” For example, one of our class extends VectorUDT[0] which has private[spark] class VectorUDT as its access modifier. This unfortunately put us in a strange scenario that forces us to work under the namespace org.apache.spark.

To be specific, currently the private classes/traits we need to use to create new Spark learners & Transformers are HasInputCol, VectorUDT and Logging. We will expand this list as we develop more.

I do think tis a shame that logging went from public to private.

One thing that could be done there is to copy the logging into Bahir, under an org.apache.bahir package, for yourself and others to use. That's be beneficial to me too.

For the ML stuff, that might be place to work too, if you are going to open source the code.




Is there a way to avoid this namespace issue? What do other people/companies do in this scenario? Thank you for your help!

I've hit this problem in the past.  Scala code tends to force your hand here precisely because of that (very nice) private feature. While it offers the ability of a project to guarantee that implementation details aren't picked up where they weren't intended to be, in OSS dev, all that implementation is visible and for lower level integration,

What I tend to do is keep my own code in its package and try to do as think a bridge over to it from the [private] scope. It's also important to name things obviously, say,  org.apache.spark.microsoft , so stack traces in bug reports can be dealt with more easily



[0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fspark%2Fblob%2Fmaster%2Fmllib%2Fsrc%2Fmain%2Fscala%2Forg%2Fapache%2Fspark%2Fml%2Flinalg%2FVectorUDT.scala&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=HjxQq3XAT%2FMljuNdU0MOorPhblMrnFcLezj9tebAht8%3D&reserved=0>

Best,
Shouheng




--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdatabricks.com%2F&data=02%7C01%7Cshouyi%40microsoft.com%7C1e65a9468afa4348a0ac08d45cd7c42c%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636235529161198274&sdata=Yq5F7xzV%2B8aqAoJyF0gePMG2cghRYonz68NDNvN9vjs%3D&reserved=0>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Tim Hunter <ti...@databricks.com>.

Regarding logging, Graphframes makes a simple wrapper this way:

https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/
graphframes/Logging.scala

Regarding the UDTs, they have been hidden to be reworked for Datasets, the
reasons being detailed here [1]. Can you describe your use case in more
details? You may be better off copy/pasting the UDT code outside of Spark,
depending on your use case.

[1] https://issues.apache.org/jira/browse/SPARK-14155

On Thu, Feb 23, 2017 at 3:42 PM, Joseph Bradley <jo...@databricks.com>
wrote:

> +1 for Nick's comment about discussing APIs which need to be made public
> in https://issues.apache.org/jira/browse/SPARK-19498 !
>
> On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <st...@hortonworks.com>
> wrote:
>
>>
>> On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>
>> wrote:
>>
>> Hi Spark developers,
>>
>> Currently my team at Microsoft is extending Spark’s machine learning
>> functionalities to include new learners and transformers. We would like
>> users to use these within spark pipelines so that they can mix and match
>> with existing Spark learners/transformers, and overall have a native spark
>> experience. We cannot accomplish this using a non-“org.apache” namespace
>> with the current implementation, and we don’t want to release code inside
>> the apache namespace because it’s confusing and there could be naming
>> rights issues.
>>
>>
>> This isn't actually the ASF has a strong stance against, more left to
>> projects themselves. After all: the source is licensed by the ASF, and the
>> license doesn't say you can't.
>>
>> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
>> hive team kept stuff package private. Though that's really a sign that
>> things could be improved there.
>>
>> Where is problematic is that stack traces end up blaming the wrong group;
>> nobody likes getting a bug report which doesn't actually exist in your
>> codebase., not least because you have to waste time to even work it out.
>>
>> You also have to expect absolutely no stability guarantees, so you'd
>> better set your nightly build to work against trunk
>>
>> Apache Bahir does put some stuff into org.apache.spark.stream, but
>> they've sort of inherited that right.when they picked up the code from
>> spark. new stuff is going into org.apache.bahir
>>
>>
>> We need to extend several classes from spark which happen to have
>> “private[spark].” For example, one of our class extends VectorUDT[0] which
>> has private[spark] class VectorUDT as its access modifier. This
>> unfortunately put us in a strange scenario that forces us to work under the
>> namespace org.apache.spark.
>>
>> To be specific, currently the private classes/traits we need to use to
>> create new Spark learners & Transformers are HasInputCol, VectorUDT and
>> Logging. We will expand this list as we develop more.
>>
>>
>> I do think tis a shame that logging went from public to private.
>>
>> One thing that could be done there is to copy the logging into Bahir,
>> under an org.apache.bahir package, for yourself and others to use. That's
>> be beneficial to me too.
>>
>> For the ML stuff, that might be place to work too, if you are going to
>> open source the code.
>>
>>
>>
>> Is there a way to avoid this namespace issue? What do other
>> people/companies do in this scenario? Thank you for your help!
>>
>>
>> I've hit this problem in the past.  Scala code tends to force your hand
>> here precisely because of that (very nice) private feature. While it offers
>> the ability of a project to guarantee that implementation details aren't
>> picked up where they weren't intended to be, in OSS dev, all that
>> implementation is visible and for lower level integration,
>>
>> What I tend to do is keep my own code in its package and try to do as
>> think a bridge over to it from the [private] scope. It's also important to
>> name things obviously, say,  org.apache.spark.microsoft , so stack traces
>> in bug reports can be dealt with more easily
>>
>>
>> [0]: https://github.com/apache/spark/blob/master/mllib/src/
>> main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
>>
>> Best,
>> Shouheng
>>
>>
>>
>
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] <http://databricks.com/>
>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Joseph Bradley <jo...@databricks.com>.

+1 for Nick's comment about discussing APIs which need to be made public in
https://issues.apache.org/jira/browse/SPARK-19498 !

On Thu, Feb 23, 2017 at 2:36 AM, Steve Loughran <st...@hortonworks.com>
wrote:

>
> On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>
> wrote:
>
> Hi Spark developers,
>
> Currently my team at Microsoft is extending Spark’s machine learning
> functionalities to include new learners and transformers. We would like
> users to use these within spark pipelines so that they can mix and match
> with existing Spark learners/transformers, and overall have a native spark
> experience. We cannot accomplish this using a non-“org.apache” namespace
> with the current implementation, and we don’t want to release code inside
> the apache namespace because it’s confusing and there could be naming
> rights issues.
>
>
> This isn't actually the ASF has a strong stance against, more left to
> projects themselves. After all: the source is licensed by the ASF, and the
> license doesn't say you can't.
>
> Indeed, there's a bit of org.apache.hive in the Spark codebase where the
> hive team kept stuff package private. Though that's really a sign that
> things could be improved there.
>
> Where is problematic is that stack traces end up blaming the wrong group;
> nobody likes getting a bug report which doesn't actually exist in your
> codebase., not least because you have to waste time to even work it out.
>
> You also have to expect absolutely no stability guarantees, so you'd
> better set your nightly build to work against trunk
>
> Apache Bahir does put some stuff into org.apache.spark.stream, but they've
> sort of inherited that right.when they picked up the code from spark. new
> stuff is going into org.apache.bahir
>
>
> We need to extend several classes from spark which happen to have
> “private[spark].” For example, one of our class extends VectorUDT[0] which
> has private[spark] class VectorUDT as its access modifier. This
> unfortunately put us in a strange scenario that forces us to work under the
> namespace org.apache.spark.
>
> To be specific, currently the private classes/traits we need to use to
> create new Spark learners & Transformers are HasInputCol, VectorUDT and
> Logging. We will expand this list as we develop more.
>
>
> I do think tis a shame that logging went from public to private.
>
> One thing that could be done there is to copy the logging into Bahir,
> under an org.apache.bahir package, for yourself and others to use. That's
> be beneficial to me too.
>
> For the ML stuff, that might be place to work too, if you are going to
> open source the code.
>
>
>
> Is there a way to avoid this namespace issue? What do other
> people/companies do in this scenario? Thank you for your help!
>
>
> I've hit this problem in the past.  Scala code tends to force your hand
> here precisely because of that (very nice) private feature. While it offers
> the ability of a project to guarantee that implementation details aren't
> picked up where they weren't intended to be, in OSS dev, all that
> implementation is visible and for lower level integration,
>
> What I tend to do is keep my own code in its package and try to do as
> think a bridge over to it from the [private] scope. It's also important to
> name things obviously, say,  org.apache.spark.microsoft , so stack traces
> in bug reports can be dealt with more easily
>
>
> [0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/
> apache/spark/ml/linalg/VectorUDT.scala
>
> Best,
> Shouheng
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>

Re: [Spark Namespace]: Expanding Spark ML under Different Namespace?

Posted by Steve Loughran <st...@hortonworks.com>.

On 22 Feb 2017, at 20:51, Shouheng Yi <sh...@microsoft.com.INVALID>> wrote:

Hi Spark developers,

Currently my team at Microsoft is extending Spark’s machine learning functionalities to include new learners and transformers. We would like users to use these within spark pipelines so that they can mix and match with existing Spark learners/transformers, and overall have a native spark experience. We cannot accomplish this using a non-“org.apache” namespace with the current implementation, and we don’t want to release code inside the apache namespace because it’s confusing and there could be naming rights issues.

This isn't actually the ASF has a strong stance against, more left to projects themselves. After all: the source is licensed by the ASF, and the license doesn't say you can't.

Indeed, there's a bit of org.apache.hive in the Spark codebase where the hive team kept stuff package private. Though that's really a sign that things could be improved there.

Where is problematic is that stack traces end up blaming the wrong group; nobody likes getting a bug report which doesn't actually exist in your codebase., not least because you have to waste time to even work it out.

You also have to expect absolutely no stability guarantees, so you'd better set your nightly build to work against trunk

Apache Bahir does put some stuff into org.apache.spark.stream, but they've sort of inherited that right.when they picked up the code from spark. new stuff is going into org.apache.bahir


We need to extend several classes from spark which happen to have “private[spark].” For example, one of our class extends VectorUDT[0] which has private[spark] class VectorUDT as its access modifier. This unfortunately put us in a strange scenario that forces us to work under the namespace org.apache.spark.

To be specific, currently the private classes/traits we need to use to create new Spark learners & Transformers are HasInputCol, VectorUDT and Logging. We will expand this list as we develop more.

I do think tis a shame that logging went from public to private.

One thing that could be done there is to copy the logging into Bahir, under an org.apache.bahir package, for yourself and others to use. That's be beneficial to me too.

For the ML stuff, that might be place to work too, if you are going to open source the code.



Is there a way to avoid this namespace issue? What do other people/companies do in this scenario? Thank you for your help!

I've hit this problem in the past.  Scala code tends to force your hand here precisely because of that (very nice) private feature. While it offers the ability of a project to guarantee that implementation details aren't picked up where they weren't intended to be, in OSS dev, all that implementation is visible and for lower level integration,

What I tend to do is keep my own code in its package and try to do as think a bridge over to it from the [private] scope. It's also important to name things obviously, say,  org.apache.spark.microsoft , so stack traces in bug reports can be dealt with more easily


[0]: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala

Best,
Shouheng