You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/02 17:51:56 UTC
[GitHub] asmello edited a comment on issue #23882:
[SPARK-26979][PySpark][WIP] Add missing column name support for some SQL
functions
asmello edited a comment on issue #23882: [SPARK-26979][PySpark][WIP] Add missing column name support for some SQL functions
URL: https://github.com/apache/spark/pull/23882#issuecomment-468942194
Ok, this was tedious. I've whitelisted all functions defined in `functions.py`, save for the previously identified exceptions, which were all being defined automatically.
All these functions take columns as arguments, and they **explicitly** handle Column/name duality:
```
approx_count_distinct
coalesce
corr
covar_pop
covar_samp
countDistinct
first
grouping
grouping_id
isnan
isnull
last
nanvl
round
bround
shiftLeft
shiftRight
shiftRightUnsigned
struct
greatest
least
when
log
log2
conv
factorial
lag
col
year
quarter
month
dayofweek
dayofmonth
dayofyear
hour
minute
second
weekofyear
date_add
date_sub
datediff
add_months
months_between
to_date
to_timestamp
trunc
date_trunc
next_day
last_day
from_unixtime
unix_timestamp
from_utc_timestamp
to_utc_timestamp
window
crc32
md5
sha1
sha2
hash
concat_ws
decode
encode
format_number
format_string
instr
substring
substring_index
levenshtein
locate
lpad
rpad
repeat
split
regexp_extract
regexp_replace
soundex
bin
hex
unhex
length
translate
create_map
map_from_arrays
array
array_contains
arrays_overlap
slice
def
concat
array_position
element_at
array_remove
array_distinct
array_intersect
array_union
array_except
explode
posexplode
explode_outer
posexplode_outer
get_json_object
json_tuple
from_json
to_json
schema_of_json
schema_of_csv
col
size
array_min
array_max
sort_array
array_sort
shuffle
reverse
flatten
map_keys
map_values
map_entries
map_from_entries
array_repeat
arrays_zip
map_concat
sequence
from_csv
```
@HyukjinKwon you mentioned `from_json()`, but that's not a true exception. The schema argument is not meant to be an actual column, it's a column specification, so it's a completely different case. That same function also takes an actual column to operate on, which it explicitly accepts as a name or instance. So there's absolutely no problem there.
As for the dataframe methods in `dataframe.py`, these functions also take column names or instances and do explicit conversion:
```
sampleBy
sort
select
groupBy
rollup
cube
dropna
fillna
replace
drop
toDF
```
I was even a little surprised by `drop()`, I thought it would only take names, but it does handle both cases. **EDIT: Sorry, I double checked and it does support both, but only when you give a single argument. If you drop more than one column, it enforces a check for strings. And the reason is because there is no Seq<Column> support on the Scala side.**
The following functions are also 100% compliant with both API styles, but they rely on the jvm backend to do the conversion:
```
join
sortWithinPartitions
describe
__getitem__
__getattr__
dropDuplicates
```
This is something that should be kept in mind should the Scala API change.
The following functions are violations, they only take **strings**:
```
approxQuantile
corr
cov
crosstab
freqItem
```
The reason these are exceptions is that they're implemented in `org.apache.spark.sql.DataFrameStatFunctions`, and for some reason almost all functions there only take name arguments. The PySpark never converts from Column object to string, so it also only supports the name spec. This is something that should be fixed on the Scala side IMO. Since this patch is about making _name_ support universal, however, I don't think it falls under scope to do anything about that.
One other special case I should mention is `colRegex`, which also only takes strings. But in that case, it just wouldn't make sense to take Column objects (how would you specific a regex that way???), so that is fine.
With these checks in place, I'm confident the list of exceptions in this PR's description **is** exhaustive. I will work on solving those now as agreed (by redefining the `_create_function` argument passing strategy to do explicit conversion).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org