You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by SamPenrose <sp...@mozilla.com> on 2016/11/12 00:36:25 UTC

pyspark: accept unicode column names in DataFrame.corr and cov

The corr() and cov() methods of DataFrame require an instance of str for
column names:

. 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1356

although instances of basestring appear to work for addressing columns:

. 
https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L708

Humble request: could we replace the "isinstance(col1, str)" tests with
"isinstance(col1, basestring)"?

Less humble request: why test types at all? Why not just do one of {raise
KeyError, coerce to string}?

Cheers,
Sam



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-accept-unicode-column-names-in-DataFrame-corr-and-cov-tp28065.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: pyspark: accept unicode column names in DataFrame.corr and cov

Posted by Hyukjin Kwon <gu...@gmail.com>.

Hi Sam,

I think I have some answers for two questions.

> Humble request: could we replace the "isinstance(col1, str)" tests with
"isinstance(col1, basestring)"?

IMHO, yes, I believe this should be basestring. Otherwise, some functions
would not accept unicode as arguments for columns in Python 2.7.

> Less humble request: why test types at all? Why not just do one of {raise
KeyError, coerce to string}?

I believe argument type checking is pretty common in other Python libraries
too such as numpy.
ValueError might be more appropriate because the type of the value is not
correct.
Also, I think forcing it into string might confuse user.

If the current why is problematic and not coherent, I guess you should
change this but I think
it is okay as it is.

Thanks.

2016-11-12 9:36 GMT+09:00 SamPenrose <sp...@mozilla.com>:

> The corr() and cov() methods of DataFrame require an instance of str for
> column names:
>
> .
> https://github.com/apache/spark/blob/master/python/
> pyspark/sql/dataframe.py#L1356
>
> although instances of basestring appear to work for addressing columns:
>
> .
> https://github.com/apache/spark/blob/master/python/
> pyspark/sql/dataframe.py#L708
>
> Humble request: could we replace the "isinstance(col1, str)" tests with
> "isinstance(col1, basestring)"?
>
> Less humble request: why test types at all? Why not just do one of {raise
> KeyError, coerce to string}?
>
> Cheers,
> Sam
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/pyspark-accept-unicode-column-names-
> in-DataFrame-corr-and-cov-tp28065.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>