You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Reynold Xin (JIRA)" <ji...@apache.org> on 2015/04/21 20:03:58 UTC

[jira] [Comment Edited] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame

    [ https://issues.apache.org/jira/browse/SPARK-7035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14505362#comment-14505362 ] 

Reynold Xin edited comment on SPARK-7035 at 4/21/15 6:03 PM:
-------------------------------------------------------------

It'd be great to understand the delta between this and Pandas, and do something accordingly.

Is the problem that Pandas simply patched the current object, while we are using getattr? If yes, maybe the Pandas way is better and we should switch to it. However, my Python knowledge is limited and would be great for others to chime in.


was (Author: rxin):
It'd be great to understand the delta between this and Pandas, and do something accordingly.


> Drop __getattr__ on pyspark.sql.DataFrame
> -----------------------------------------
>
>                 Key: SPARK-7035
>                 URL: https://issues.apache.org/jira/browse/SPARK-7035
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 1.4.0
>            Reporter: Kalle Jepsen
>
> I think the {{\_\_getattr\_\_}} method on the DataFrame should be removed.
> There is no point in having the possibility to address the DataFrames columns as {{df.column}}, other than the questionable goal to please R developers. And it seems R people can use Spark from their native API in the future.
> I see the following problems with {{\_\_getattr\_\_}} for column selection:
> * It's un-pythonic: There should only be one obvious way to solve a problem, and we can already address columns on a DataFrame via the {{\_\_getitem\_\_}} method, which in my opinion is by far superior and a lot more intuitive.
> * It leads to confusing Exceptions. When we mistype a method-name the {{AttributeError}} will say 'No such column ... '.
> * And most importantly: we cannot load DataFrames that have columns with the same name as any attribute on the DataFrame-object. Imagine having a DataFrame with a column named {{cache}} or {{filter}}. Calling {{df.cache()}} will be ambiguous and lead to broken code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org

[jira] [Comment Edited] (SPARK-7035) Drop __getattr__ on pyspark.sql.DataFrame

[jira] [Comment Edited] (SPARK-7035) Drop getattr on pyspark.sql.DataFrame