You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Xiangrui Meng <me...@gmail.com> on 2015/05/08 08:59:37 UTC

Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

Hi all,

In PySpark, a DataFrame column can be referenced using df["abcd"]
(__getitem__) and df.abcd (__getattr__). There is a discussion on
SPARK-7035 on compatibility issues with the __getattr__ approach, and
I want to collect more inputs on this.

Basically, if in the future we introduce a new method to DataFrame, it
may break user code that uses the same attr to reference a column or
silently changes its behavior. For example, if we add name() to
DataFrame in the next release, all existing code using `df.name` to
reference a column called "name" will break. If we add `name()` as a
property instead of a method, all existing code using `df.name` may
still work but with a different meaning. `df.select(df.name)` no
longer selects the column called "name" but the column that has the
same name as `df.name`.

There are several proposed solutions:

1. Keep both df.abcd and df["abcd"], and encourage users to use the
latter that is future proof. This is the current solution in master
(https://github.com/apache/spark/pull/5971). But I think users may be
still unaware of the compatibility issue and prefer `df.abcd` to
`df["abcd"]` because the former could be auto-completed.
2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
JIRA page: "I actually dragged my feet on the _getattr_ issue for
several months back in the day, then finally added it (and tab
completion in IPython with _dir_), and immediately noticed a huge
quality-of-life improvement when using pandas for actual (esp.
interactive) work."
3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
df["abcd"] would be future proof, and df.abcd_ could be
auto-completed. The tradeoff is apparently the extra "_" appearing in
the code.

My preference is 3 > 1 > 2. Your inputs would be greatly appreciated. Thanks!

Best,
Xiangrui

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

Posted by Punyashloka Biswal <pu...@gmail.com>.
Is there a foolproof way to access methods exclusively (instead of picking
between columns and methods at runtime)? Here are two ideas, neither of
which seems particularly Pythonic

   - pyspark.sql.methods(df).name()
   - df.__methods__.name()

Punya

On Fri, May 8, 2015 at 10:06 AM Nicholas Chammas <ni...@gmail.com>
wrote:

> And a link to SPARK-7035
> <https://issues.apache.org/jira/browse/SPARK-7035> (which
> Xiangrui mentioned in his initial email) for the lazy.
>
> On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng <me...@gmail.com> wrote:
>
> > On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
> > <sh...@eecs.berkeley.edu> wrote:
> > > I dont know much about Python style, but I think the point Wes made
> about
> > > usability on the JIRA is pretty powerful. IMHO the number of methods
> on a
> > > Spark DataFrame might not be much more compared to Pandas. Given that
> it
> > > looks like users are okay with the possibility of collisions in Pandas
> I
> > > think sticking (1) is not a bad idea.
> > >
> >
> > This is true for interactive work. Spark's DataFrames can handle
> > really large datasets, which might be used in production workflows. So
> > I think it is reasonable for us to care more about compatibility
> > issues than Pandas.
> >
> > > Also is it possible to detect such collisions in Python ? A (4)th
> option
> > > might be to detect that `df` contains a column named `name` and print a
> > > warning in `df.name` which tells the user that the method is
> overriding
> > the
> > > column.
> >
> > Maybe we can inspect the frame `df.name` gets called and warn users in
> > `df.select(df.name)` but not in `name = df.name`. This could be tricky
> > to implement.
> >
> > -Xiangrui
> >
> > >
> > > Thanks
> > > Shivaram
> > >
> > >
> > > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <me...@gmail.com>
> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> In PySpark, a DataFrame column can be referenced using df["abcd"]
> > >> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> > >> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> > >> I want to collect more inputs on this.
> > >>
> > >> Basically, if in the future we introduce a new method to DataFrame, it
> > >> may break user code that uses the same attr to reference a column or
> > >> silently changes its behavior. For example, if we add name() to
> > >> DataFrame in the next release, all existing code using `df.name` to
> > >> reference a column called "name" will break. If we add `name()` as a
> > >> property instead of a method, all existing code using `df.name` may
> > >> still work but with a different meaning. `df.select(df.name)` no
> > >> longer selects the column called "name" but the column that has the
> > >> same name as `df.name`.
> > >>
> > >> There are several proposed solutions:
> > >>
> > >> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> > >> latter that is future proof. This is the current solution in master
> > >> (https://github.com/apache/spark/pull/5971). But I think users may be
> > >> still unaware of the compatibility issue and prefer `df.abcd` to
> > >> `df["abcd"]` because the former could be auto-completed.
> > >> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> > >> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> > >> several months back in the day, then finally added it (and tab
> > >> completion in IPython with _dir_), and immediately noticed a huge
> > >> quality-of-life improvement when using pandas for actual (esp.
> > >> interactive) work."
> > >> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> > >> df["abcd"] would be future proof, and df.abcd_ could be
> > >> auto-completed. The tradeoff is apparently the extra "_" appearing in
> > >> the code.
> > >>
> > >> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> > >> Thanks!
> > >>
> > >> Best,
> > >> Xiangrui
> > >>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > >> For additional commands, e-mail: dev-help@spark.apache.org
> > >>
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> > For additional commands, e-mail: dev-help@spark.apache.org
> >
> >
>

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

Posted by Nicholas Chammas <ni...@gmail.com>.
And a link to SPARK-7035
<https://issues.apache.org/jira/browse/SPARK-7035> (which
Xiangrui mentioned in his initial email) for the lazy.

On Fri, May 8, 2015 at 3:41 AM Xiangrui Meng <me...@gmail.com> wrote:

> On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
> <sh...@eecs.berkeley.edu> wrote:
> > I dont know much about Python style, but I think the point Wes made about
> > usability on the JIRA is pretty powerful. IMHO the number of methods on a
> > Spark DataFrame might not be much more compared to Pandas. Given that it
> > looks like users are okay with the possibility of collisions in Pandas I
> > think sticking (1) is not a bad idea.
> >
>
> This is true for interactive work. Spark's DataFrames can handle
> really large datasets, which might be used in production workflows. So
> I think it is reasonable for us to care more about compatibility
> issues than Pandas.
>
> > Also is it possible to detect such collisions in Python ? A (4)th option
> > might be to detect that `df` contains a column named `name` and print a
> > warning in `df.name` which tells the user that the method is overriding
> the
> > column.
>
> Maybe we can inspect the frame `df.name` gets called and warn users in
> `df.select(df.name)` but not in `name = df.name`. This could be tricky
> to implement.
>
> -Xiangrui
>
> >
> > Thanks
> > Shivaram
> >
> >
> > On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <me...@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> In PySpark, a DataFrame column can be referenced using df["abcd"]
> >> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> >> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> >> I want to collect more inputs on this.
> >>
> >> Basically, if in the future we introduce a new method to DataFrame, it
> >> may break user code that uses the same attr to reference a column or
> >> silently changes its behavior. For example, if we add name() to
> >> DataFrame in the next release, all existing code using `df.name` to
> >> reference a column called "name" will break. If we add `name()` as a
> >> property instead of a method, all existing code using `df.name` may
> >> still work but with a different meaning. `df.select(df.name)` no
> >> longer selects the column called "name" but the column that has the
> >> same name as `df.name`.
> >>
> >> There are several proposed solutions:
> >>
> >> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> >> latter that is future proof. This is the current solution in master
> >> (https://github.com/apache/spark/pull/5971). But I think users may be
> >> still unaware of the compatibility issue and prefer `df.abcd` to
> >> `df["abcd"]` because the former could be auto-completed.
> >> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> >> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> >> several months back in the day, then finally added it (and tab
> >> completion in IPython with _dir_), and immediately noticed a huge
> >> quality-of-life improvement when using pandas for actual (esp.
> >> interactive) work."
> >> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> >> df["abcd"] would be future proof, and df.abcd_ could be
> >> auto-completed. The tradeoff is apparently the extra "_" appearing in
> >> the code.
> >>
> >> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> >> Thanks!
> >>
> >> Best,
> >> Xiangrui
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> >> For additional commands, e-mail: dev-help@spark.apache.org
> >>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

Posted by Xiangrui Meng <me...@gmail.com>.
On Fri, May 8, 2015 at 12:18 AM, Shivaram Venkataraman
<sh...@eecs.berkeley.edu> wrote:
> I dont know much about Python style, but I think the point Wes made about
> usability on the JIRA is pretty powerful. IMHO the number of methods on a
> Spark DataFrame might not be much more compared to Pandas. Given that it
> looks like users are okay with the possibility of collisions in Pandas I
> think sticking (1) is not a bad idea.
>

This is true for interactive work. Spark's DataFrames can handle
really large datasets, which might be used in production workflows. So
I think it is reasonable for us to care more about compatibility
issues than Pandas.

> Also is it possible to detect such collisions in Python ? A (4)th option
> might be to detect that `df` contains a column named `name` and print a
> warning in `df.name` which tells the user that the method is overriding the
> column.

Maybe we can inspect the frame `df.name` gets called and warn users in
`df.select(df.name)` but not in `name = df.name`. This could be tricky
to implement.

-Xiangrui

>
> Thanks
> Shivaram
>
>
> On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <me...@gmail.com> wrote:
>>
>> Hi all,
>>
>> In PySpark, a DataFrame column can be referenced using df["abcd"]
>> (__getitem__) and df.abcd (__getattr__). There is a discussion on
>> SPARK-7035 on compatibility issues with the __getattr__ approach, and
>> I want to collect more inputs on this.
>>
>> Basically, if in the future we introduce a new method to DataFrame, it
>> may break user code that uses the same attr to reference a column or
>> silently changes its behavior. For example, if we add name() to
>> DataFrame in the next release, all existing code using `df.name` to
>> reference a column called "name" will break. If we add `name()` as a
>> property instead of a method, all existing code using `df.name` may
>> still work but with a different meaning. `df.select(df.name)` no
>> longer selects the column called "name" but the column that has the
>> same name as `df.name`.
>>
>> There are several proposed solutions:
>>
>> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
>> latter that is future proof. This is the current solution in master
>> (https://github.com/apache/spark/pull/5971). But I think users may be
>> still unaware of the compatibility issue and prefer `df.abcd` to
>> `df["abcd"]` because the former could be auto-completed.
>> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
>> JIRA page: "I actually dragged my feet on the _getattr_ issue for
>> several months back in the day, then finally added it (and tab
>> completion in IPython with _dir_), and immediately noticed a huge
>> quality-of-life improvement when using pandas for actual (esp.
>> interactive) work."
>> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
>> df["abcd"] would be future proof, and df.abcd_ could be
>> auto-completed. The tradeoff is apparently the extra "_" appearing in
>> the code.
>>
>> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
>> Thanks!
>>
>> Best,
>> Xiangrui
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
>> For additional commands, e-mail: dev-help@spark.apache.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Re: Collect inputs on SPARK-7035: compatibility issue with DataFrame.__getattr__

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.
I dont know much about Python style, but I think the point Wes made about
usability on the JIRA is pretty powerful. IMHO the number of methods on a
Spark DataFrame might not be much more compared to Pandas. Given that it
looks like users are okay with the possibility of collisions in Pandas I
think sticking (1) is not a bad idea.

Also is it possible to detect such collisions in Python ? A (4)th option
might be to detect that `df` contains a column named `name` and print a
warning in `df.name` which tells the user that the method is overriding the
column.

Thanks
Shivaram


On Thu, May 7, 2015 at 11:59 PM, Xiangrui Meng <me...@gmail.com> wrote:

> Hi all,
>
> In PySpark, a DataFrame column can be referenced using df["abcd"]
> (__getitem__) and df.abcd (__getattr__). There is a discussion on
> SPARK-7035 on compatibility issues with the __getattr__ approach, and
> I want to collect more inputs on this.
>
> Basically, if in the future we introduce a new method to DataFrame, it
> may break user code that uses the same attr to reference a column or
> silently changes its behavior. For example, if we add name() to
> DataFrame in the next release, all existing code using `df.name` to
> reference a column called "name" will break. If we add `name()` as a
> property instead of a method, all existing code using `df.name` may
> still work but with a different meaning. `df.select(df.name)` no
> longer selects the column called "name" but the column that has the
> same name as `df.name`.
>
> There are several proposed solutions:
>
> 1. Keep both df.abcd and df["abcd"], and encourage users to use the
> latter that is future proof. This is the current solution in master
> (https://github.com/apache/spark/pull/5971). But I think users may be
> still unaware of the compatibility issue and prefer `df.abcd` to
> `df["abcd"]` because the former could be auto-completed.
> 2. Drop df.abcd and support df["abcd"] only. From Wes' comment on the
> JIRA page: "I actually dragged my feet on the _getattr_ issue for
> several months back in the day, then finally added it (and tab
> completion in IPython with _dir_), and immediately noticed a huge
> quality-of-life improvement when using pandas for actual (esp.
> interactive) work."
> 3. Replace df.abcd by df.abcd_ (with a suffix "_"). Both df.abcd_ and
> df["abcd"] would be future proof, and df.abcd_ could be
> auto-completed. The tradeoff is apparently the extra "_" appearing in
> the code.
>
> My preference is 3 > 1 > 2. Your inputs would be greatly appreciated.
> Thanks!
>
> Best,
> Xiangrui
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>