You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Reynold Xin <rx...@databricks.com> on 2015/04/30 00:21:47 UTC

[discuss] DataFrame function namespacing

Before we make DataFrame non-alpha, it would be great to decide how we want
to namespace all the functions. There are 3 alternatives:

1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
functions.

2. Have explicit namespaces, which is what master branch currently looks
like:

- org.apache.spark.sql.functions
- org.apache.spark.sql.mathfunctions
- ...

3. Have explicit namespaces, but restructure them slightly so everything is
under functions.

package object functions {

  // all the old functions here -- but deprecated so we keep source
compatibility
  def ...
}

package org.apache.spark.sql.functions

object mathFunc {
  ...
}

object basicFuncs {
  ...
}

Re: [discuss] DataFrame function namespacing

Posted by Reynold Xin <rx...@databricks.com>.

After talking with people on this thread and offline, I've decided to go
with option 1, i.e. putting everything in a single "functions" object.


On Thu, Apr 30, 2015 at 10:04 AM, Ted Yu <yu...@gmail.com> wrote:

> IMHO I would go with choice #1
>
> Cheers
>
> On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> We definitely still have the name collision problem in SQL.
>>
>> On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal <
>> punya.biswal@gmail.com
>> > wrote:
>>
>> > Do we still have to keep the names of the functions distinct to avoid
>> > collisions in SQL? Or is there a plan to allow "importing" a namespace
>> into
>> > SQL somehow?
>> >
>> > I ask because if we have to keep worrying about name collisions then I'm
>> > not sure what the added complexity of #2 and #3 buys us.
>> >
>> > Punya
>> >
>> > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin <rx...@databricks.com>
>> wrote:
>> >
>> >> Scaladoc isn't much of a problem because scaladocs are grouped.
>> >> Java/Python
>> >> is the main problem ...
>> >>
>> >> See
>> >>
>> >>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>> >>
>> >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
>> >> shivaram@eecs.berkeley.edu> wrote:
>> >>
>> >> > My feeling is that we should have a handful of namespaces (say 4 or
>> 5).
>> >> It
>> >> > becomes too cumbersome to import / remember more package names and
>> >> having
>> >> > everything in one package makes it hard to read scaladoc etc.
>> >> >
>> >> > Thanks
>> >> > Shivaram
>> >> >
>> >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com>
>> >> wrote:
>> >> >
>> >> >> To add a little bit more context, some pros/cons I can think of are:
>> >> >>
>> >> >> Option 1: Very easy for users to find the function, since they are
>> all
>> >> in
>> >> >> org.apache.spark.sql.functions. However, there will be quite a large
>> >> >> number
>> >> >> of them.
>> >> >>
>> >> >> Option 2: I can't tell why we would want this one over Option 3,
>> since
>> >> it
>> >> >> has all the problems of Option 3, and not as nice of a hierarchy.
>> >> >>
>> >> >> Option 3: Opposite of Option 1. Each "package" or static class has a
>> >> small
>> >> >> number of functions that are relevant to each other, but for some
>> >> >> functions
>> >> >> it is unclear where they should go (e.g. should "min" go into basic
>> or
>> >> >> math?)
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com>
>> >> wrote:
>> >> >>
>> >> >> > Before we make DataFrame non-alpha, it would be great to decide
>> how
>> >> we
>> >> >> > want to namespace all the functions. There are 3 alternatives:
>> >> >> >
>> >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does
>> >> it,
>> >> >> > since SQL doesn't have namespaces. I estimate eventually we will
>> >> have ~
>> >> >> 200
>> >> >> > functions.
>> >> >> >
>> >> >> > 2. Have explicit namespaces, which is what master branch currently
>> >> looks
>> >> >> > like:
>> >> >> >
>> >> >> > - org.apache.spark.sql.functions
>> >> >> > - org.apache.spark.sql.mathfunctions
>> >> >> > - ...
>> >> >> >
>> >> >> > 3. Have explicit namespaces, but restructure them slightly so
>> >> everything
>> >> >> > is under functions.
>> >> >> >
>> >> >> > package object functions {
>> >> >> >
>> >> >> >   // all the old functions here -- but deprecated so we keep
>> source
>> >> >> > compatibility
>> >> >> >   def ...
>> >> >> > }
>> >> >> >
>> >> >> > package org.apache.spark.sql.functions
>> >> >> >
>> >> >> > object mathFunc {
>> >> >> >   ...
>> >> >> > }
>> >> >> >
>> >> >> > object basicFuncs {
>> >> >> >   ...
>> >> >> > }
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >> >
>> >>
>> >
>>
>
>

Re: [discuss] DataFrame function namespacing

Posted by Ted Yu <yu...@gmail.com>.

IMHO I would go with choice #1

Cheers

On Wed, Apr 29, 2015 at 10:03 PM, Reynold Xin <rx...@databricks.com> wrote:

> We definitely still have the name collision problem in SQL.
>
> On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal <
> punya.biswal@gmail.com
> > wrote:
>
> > Do we still have to keep the names of the functions distinct to avoid
> > collisions in SQL? Or is there a plan to allow "importing" a namespace
> into
> > SQL somehow?
> >
> > I ask because if we have to keep worrying about name collisions then I'm
> > not sure what the added complexity of #2 and #3 buys us.
> >
> > Punya
> >
> > On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin <rx...@databricks.com> wrote:
> >
> >> Scaladoc isn't much of a problem because scaladocs are grouped.
> >> Java/Python
> >> is the main problem ...
> >>
> >> See
> >>
> >>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
> >>
> >> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
> >> shivaram@eecs.berkeley.edu> wrote:
> >>
> >> > My feeling is that we should have a handful of namespaces (say 4 or
> 5).
> >> It
> >> > becomes too cumbersome to import / remember more package names and
> >> having
> >> > everything in one package makes it hard to read scaladoc etc.
> >> >
> >> > Thanks
> >> > Shivaram
> >> >
> >> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com>
> >> wrote:
> >> >
> >> >> To add a little bit more context, some pros/cons I can think of are:
> >> >>
> >> >> Option 1: Very easy for users to find the function, since they are
> all
> >> in
> >> >> org.apache.spark.sql.functions. However, there will be quite a large
> >> >> number
> >> >> of them.
> >> >>
> >> >> Option 2: I can't tell why we would want this one over Option 3,
> since
> >> it
> >> >> has all the problems of Option 3, and not as nice of a hierarchy.
> >> >>
> >> >> Option 3: Opposite of Option 1. Each "package" or static class has a
> >> small
> >> >> number of functions that are relevant to each other, but for some
> >> >> functions
> >> >> it is unclear where they should go (e.g. should "min" go into basic
> or
> >> >> math?)
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com>
> >> wrote:
> >> >>
> >> >> > Before we make DataFrame non-alpha, it would be great to decide how
> >> we
> >> >> > want to namespace all the functions. There are 3 alternatives:
> >> >> >
> >> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does
> >> it,
> >> >> > since SQL doesn't have namespaces. I estimate eventually we will
> >> have ~
> >> >> 200
> >> >> > functions.
> >> >> >
> >> >> > 2. Have explicit namespaces, which is what master branch currently
> >> looks
> >> >> > like:
> >> >> >
> >> >> > - org.apache.spark.sql.functions
> >> >> > - org.apache.spark.sql.mathfunctions
> >> >> > - ...
> >> >> >
> >> >> > 3. Have explicit namespaces, but restructure them slightly so
> >> everything
> >> >> > is under functions.
> >> >> >
> >> >> > package object functions {
> >> >> >
> >> >> >   // all the old functions here -- but deprecated so we keep source
> >> >> > compatibility
> >> >> >   def ...
> >> >> > }
> >> >> >
> >> >> > package org.apache.spark.sql.functions
> >> >> >
> >> >> > object mathFunc {
> >> >> >   ...
> >> >> > }
> >> >> >
> >> >> > object basicFuncs {
> >> >> >   ...
> >> >> > }
> >> >> >
> >> >> >
> >> >> >
> >> >>
> >> >
> >> >
> >>
> >
>

Re: [discuss] DataFrame function namespacing

Posted by Reynold Xin <rx...@databricks.com>.

We definitely still have the name collision problem in SQL.

On Wed, Apr 29, 2015 at 10:01 PM, Punyashloka Biswal <punya.biswal@gmail.com
> wrote:

> Do we still have to keep the names of the functions distinct to avoid
> collisions in SQL? Or is there a plan to allow "importing" a namespace into
> SQL somehow?
>
> I ask because if we have to keep worrying about name collisions then I'm
> not sure what the added complexity of #2 and #3 buys us.
>
> Punya
>
> On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin <rx...@databricks.com> wrote:
>
>> Scaladoc isn't much of a problem because scaladocs are grouped.
>> Java/Python
>> is the main problem ...
>>
>> See
>>
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>>
>> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
>> shivaram@eecs.berkeley.edu> wrote:
>>
>> > My feeling is that we should have a handful of namespaces (say 4 or 5).
>> It
>> > becomes too cumbersome to import / remember more package names and
>> having
>> > everything in one package makes it hard to read scaladoc etc.
>> >
>> > Thanks
>> > Shivaram
>> >
>> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >
>> >> To add a little bit more context, some pros/cons I can think of are:
>> >>
>> >> Option 1: Very easy for users to find the function, since they are all
>> in
>> >> org.apache.spark.sql.functions. However, there will be quite a large
>> >> number
>> >> of them.
>> >>
>> >> Option 2: I can't tell why we would want this one over Option 3, since
>> it
>> >> has all the problems of Option 3, and not as nice of a hierarchy.
>> >>
>> >> Option 3: Opposite of Option 1. Each "package" or static class has a
>> small
>> >> number of functions that are relevant to each other, but for some
>> >> functions
>> >> it is unclear where they should go (e.g. should "min" go into basic or
>> >> math?)
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com>
>> wrote:
>> >>
>> >> > Before we make DataFrame non-alpha, it would be great to decide how
>> we
>> >> > want to namespace all the functions. There are 3 alternatives:
>> >> >
>> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does
>> it,
>> >> > since SQL doesn't have namespaces. I estimate eventually we will
>> have ~
>> >> 200
>> >> > functions.
>> >> >
>> >> > 2. Have explicit namespaces, which is what master branch currently
>> looks
>> >> > like:
>> >> >
>> >> > - org.apache.spark.sql.functions
>> >> > - org.apache.spark.sql.mathfunctions
>> >> > - ...
>> >> >
>> >> > 3. Have explicit namespaces, but restructure them slightly so
>> everything
>> >> > is under functions.
>> >> >
>> >> > package object functions {
>> >> >
>> >> >   // all the old functions here -- but deprecated so we keep source
>> >> > compatibility
>> >> >   def ...
>> >> > }
>> >> >
>> >> > package org.apache.spark.sql.functions
>> >> >
>> >> > object mathFunc {
>> >> >   ...
>> >> > }
>> >> >
>> >> > object basicFuncs {
>> >> >   ...
>> >> > }
>> >> >
>> >> >
>> >> >
>> >>
>> >
>> >
>>
>

Re: [discuss] DataFrame function namespacing

Posted by Punyashloka Biswal <pu...@gmail.com>.

Do we still have to keep the names of the functions distinct to avoid
collisions in SQL? Or is there a plan to allow "importing" a namespace into
SQL somehow?

I ask because if we have to keep worrying about name collisions then I'm
not sure what the added complexity of #2 and #3 buys us.

Punya
On Wed, Apr 29, 2015 at 3:52 PM Reynold Xin <rx...@databricks.com> wrote:

> Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python
> is the main problem ...
>
> See
>
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$
>
> On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
> shivaram@eecs.berkeley.edu> wrote:
>
> > My feeling is that we should have a handful of namespaces (say 4 or 5).
> It
> > becomes too cumbersome to import / remember more package names and having
> > everything in one package makes it hard to read scaladoc etc.
> >
> > Thanks
> > Shivaram
> >
> > On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >
> >> To add a little bit more context, some pros/cons I can think of are:
> >>
> >> Option 1: Very easy for users to find the function, since they are all
> in
> >> org.apache.spark.sql.functions. However, there will be quite a large
> >> number
> >> of them.
> >>
> >> Option 2: I can't tell why we would want this one over Option 3, since
> it
> >> has all the problems of Option 3, and not as nice of a hierarchy.
> >>
> >> Option 3: Opposite of Option 1. Each "package" or static class has a
> small
> >> number of functions that are relevant to each other, but for some
> >> functions
> >> it is unclear where they should go (e.g. should "min" go into basic or
> >> math?)
> >>
> >>
> >>
> >>
> >> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com>
> wrote:
> >>
> >> > Before we make DataFrame non-alpha, it would be great to decide how we
> >> > want to namespace all the functions. There are 3 alternatives:
> >> >
> >> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
> >> > since SQL doesn't have namespaces. I estimate eventually we will have
> ~
> >> 200
> >> > functions.
> >> >
> >> > 2. Have explicit namespaces, which is what master branch currently
> looks
> >> > like:
> >> >
> >> > - org.apache.spark.sql.functions
> >> > - org.apache.spark.sql.mathfunctions
> >> > - ...
> >> >
> >> > 3. Have explicit namespaces, but restructure them slightly so
> everything
> >> > is under functions.
> >> >
> >> > package object functions {
> >> >
> >> >   // all the old functions here -- but deprecated so we keep source
> >> > compatibility
> >> >   def ...
> >> > }
> >> >
> >> > package org.apache.spark.sql.functions
> >> >
> >> > object mathFunc {
> >> >   ...
> >> > }
> >> >
> >> > object basicFuncs {
> >> >   ...
> >> > }
> >> >
> >> >
> >> >
> >>
> >
> >
>

Re: [discuss] DataFrame function namespacing

Posted by Reynold Xin <rx...@databricks.com>.

Scaladoc isn't much of a problem because scaladocs are grouped. Java/Python
is the main problem ...

See
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$

On Wed, Apr 29, 2015 at 3:38 PM, Shivaram Venkataraman <
shivaram@eecs.berkeley.edu> wrote:

> My feeling is that we should have a handful of namespaces (say 4 or 5). It
> becomes too cumbersome to import / remember more package names and having
> everything in one package makes it hard to read scaladoc etc.
>
> Thanks
> Shivaram
>
> On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com> wrote:
>
>> To add a little bit more context, some pros/cons I can think of are:
>>
>> Option 1: Very easy for users to find the function, since they are all in
>> org.apache.spark.sql.functions. However, there will be quite a large
>> number
>> of them.
>>
>> Option 2: I can't tell why we would want this one over Option 3, since it
>> has all the problems of Option 3, and not as nice of a hierarchy.
>>
>> Option 3: Opposite of Option 1. Each "package" or static class has a small
>> number of functions that are relevant to each other, but for some
>> functions
>> it is unclear where they should go (e.g. should "min" go into basic or
>> math?)
>>
>>
>>
>>
>> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com> wrote:
>>
>> > Before we make DataFrame non-alpha, it would be great to decide how we
>> > want to namespace all the functions. There are 3 alternatives:
>> >
>> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
>> > since SQL doesn't have namespaces. I estimate eventually we will have ~
>> 200
>> > functions.
>> >
>> > 2. Have explicit namespaces, which is what master branch currently looks
>> > like:
>> >
>> > - org.apache.spark.sql.functions
>> > - org.apache.spark.sql.mathfunctions
>> > - ...
>> >
>> > 3. Have explicit namespaces, but restructure them slightly so everything
>> > is under functions.
>> >
>> > package object functions {
>> >
>> >   // all the old functions here -- but deprecated so we keep source
>> > compatibility
>> >   def ...
>> > }
>> >
>> > package org.apache.spark.sql.functions
>> >
>> > object mathFunc {
>> >   ...
>> > }
>> >
>> > object basicFuncs {
>> >   ...
>> > }
>> >
>> >
>> >
>>
>
>

Re: [discuss] DataFrame function namespacing

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

My feeling is that we should have a handful of namespaces (say 4 or 5). It
becomes too cumbersome to import / remember more package names and having
everything in one package makes it hard to read scaladoc etc.

Thanks
Shivaram

On Wed, Apr 29, 2015 at 3:30 PM, Reynold Xin <rx...@databricks.com> wrote:

> To add a little bit more context, some pros/cons I can think of are:
>
> Option 1: Very easy for users to find the function, since they are all in
> org.apache.spark.sql.functions. However, there will be quite a large number
> of them.
>
> Option 2: I can't tell why we would want this one over Option 3, since it
> has all the problems of Option 3, and not as nice of a hierarchy.
>
> Option 3: Opposite of Option 1. Each "package" or static class has a small
> number of functions that are relevant to each other, but for some functions
> it is unclear where they should go (e.g. should "min" go into basic or
> math?)
>
>
>
>
> On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com> wrote:
>
> > Before we make DataFrame non-alpha, it would be great to decide how we
> > want to namespace all the functions. There are 3 alternatives:
> >
> > 1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
> > since SQL doesn't have namespaces. I estimate eventually we will have ~
> 200
> > functions.
> >
> > 2. Have explicit namespaces, which is what master branch currently looks
> > like:
> >
> > - org.apache.spark.sql.functions
> > - org.apache.spark.sql.mathfunctions
> > - ...
> >
> > 3. Have explicit namespaces, but restructure them slightly so everything
> > is under functions.
> >
> > package object functions {
> >
> >   // all the old functions here -- but deprecated so we keep source
> > compatibility
> >   def ...
> > }
> >
> > package org.apache.spark.sql.functions
> >
> > object mathFunc {
> >   ...
> > }
> >
> > object basicFuncs {
> >   ...
> > }
> >
> >
> >
>

Re: [discuss] DataFrame function namespacing

Posted by Reynold Xin <rx...@databricks.com>.

To add a little bit more context, some pros/cons I can think of are:

Option 1: Very easy for users to find the function, since they are all in
org.apache.spark.sql.functions. However, there will be quite a large number
of them.

Option 2: I can't tell why we would want this one over Option 3, since it
has all the problems of Option 3, and not as nice of a hierarchy.

Option 3: Opposite of Option 1. Each "package" or static class has a small
number of functions that are relevant to each other, but for some functions
it is unclear where they should go (e.g. should "min" go into basic or
math?)

On Wed, Apr 29, 2015 at 3:21 PM, Reynold Xin <rx...@databricks.com> wrote:

> Before we make DataFrame non-alpha, it would be great to decide how we
> want to namespace all the functions. There are 3 alternatives:
>
> 1. Put all in org.apache.spark.sql.functions. This is how SQL does it,
> since SQL doesn't have namespaces. I estimate eventually we will have ~ 200
> functions.
>
> 2. Have explicit namespaces, which is what master branch currently looks
> like:
>
> - org.apache.spark.sql.functions
> - org.apache.spark.sql.mathfunctions
> - ...
>
> 3. Have explicit namespaces, but restructure them slightly so everything
> is under functions.
>
> package object functions {
>
>   // all the old functions here -- but deprecated so we keep source
> compatibility
>   def ...
> }
>
> package org.apache.spark.sql.functions
>
> object mathFunc {
>   ...
> }
>
> object basicFuncs {
>   ...
> }
>
>
>