You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by André Mello <as...@gmail.com> on 2019/02/24 14:53:52 UTC

[DISCUSS][SQL][PySpark] Column name support for SQL functions

# Context

This comes from [SPARK-26979], which became PR #23879 and then PR
#23882. The following reflects all the findings made so far.

# Description

Currently, in the Scala API, some SQL functions have two overloads,
one taking a string that names the column to be operated on, the other
taking a proper Column object. This allows for two patterns of calling
these functions, which is a source of inconsistency and generates
confusion for new users, since it is hard to predict which functions
will take a column name or not.

The PySpark API partially solves this problem by internally converting
the argument to a Column object prior to passing it through to the
underlying JVM implementation. This allows for a consistent use of
name literals across the API, except for a few violations:

- lower()
- upper()
- abs()
- bitwiseNOT()
- ltrim()
- rtrim()
- trim()
- ascii()
- base64()
- unbase64()

These violations happen because for a subset of the SQL functions,
PySpark uses a functional mechanism (`_create_function`) to directly
call the underlying JVM equivalent by name, thus skipping the
conversion step. In most cases the column name pattern still works
because the Scala API has its own support for string arguments, but
the aforementioned functions are also exceptions there.

My proposal was to solve this problem by adding the string support
where it was missing in the PySpark API. Since this is a purely
additive change, it doesn't break past code. Additionally, I find the
API sugar to be a positive feature, since code like `max("foo")` is
more concise and readable than `max(col("foo"))`. It adheres to the
DRY philosophy and is consistent with Python's preference for
readability over type protection.

However, upon submission of the PR, a discussion was started about
whether it wouldn't be better to entirely deprecate string support
instead - in particular with major release 3.0 in mind. The reasoning,
as I understood it, was that this approach is more explicit and type
safe, which is preferred in Java/Scala, plus it reduces the API
surface area - and the Python API should be consistent with the others
as well.

Upon request by @HyukjinKwon I'm submitting this matter for discussion
by this mailing list.

# Summary

There is a problem with inconsistency in the Scala/Python SQL API,
where sometimes you can use a column name string as a proxy, and
sometimes you have to use a proper Column object. To solve it there
are two approaches - to remove the string support entirely, or to add
it where it is missing. Which approach is best?

Hope this is clear.

-- André.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Posted by Reynold Xin <rx...@databricks.com>.

I think the general philosophy here should be Python should be the most liberal and support a column object, or a literal value. It's also super useful to support column name, but we need to decide what happens for a string column. Is a string passed in a literal string value, or a column name?

On Mon, Mar 04, 2019 at 6:00 AM, André Mello < amello@palantir.com > wrote:

> 
> 
> 
> Hey everyone,
> 
> 
> 
>  
> 
> 
> 
> Progress has been made with PR #23882 (
> https://github.com/apache/spark/pull/23882 ) , and it is now in a state
> where it could be merged with master.
> 
> 
> 
>  
> 
> 
> 
> This is what we’re doing for now:
> 
> * PySpark *will* support strings consistently throughout its API.
> * Arguably string support makes syntax closer to SQL and Scala, where you
> can use similar shorthands to specify columns, and the general direction
> of the PySpark API has been to be consistent with those other two;
> * This is a small, additive change that will not break anything;
> * The reason support was not there in the first place was because the code
> that generated functions was originally designed for aggregators, which
> all support column names, but it was being used for other functions (e.g.
> lower, abs) that did not, so it seems like it was not intentional.
> 
> 
> 
> 
> 
> 
> We are NOT going to:
> 
> * Make any code changes in Scala;
> * This requires first deciding if string support is desirable or not;
> * Decide whether or not strings should be supported in the Scala API;
> * This requires a larger discussion and the above changes are independent
> of this;
> * Make PySpark support Column objects where it currently only supports
> strings (e.g. multi-argument version of drop());
> * Converting from Column to column name is not something the API does
> right now, so this is a stronger change;
> * This can be considered separately.
> * Do anything with R for now.
> * Anyone is free to take on this, but I have no experience with R.
> 
> 
>  
> 
> 
> 
> If you folks agree with this, let us know, so we can move forward with the
> merge.
> 
> 
> 
>  
> 
> 
> 
> Best.
> 
> 
> 
>  
> 
> 
> 
> -- André.
> 
> 
> 
>  
> 
> 
> 
> *From:* Reynold Xin < rxin@databricks.com >
> *Date:* Monday, 25 February 2019 at 00:49
> *To:* Felix Cheung < felixcheung_m@hotmail.com >
> *Cc:* dev < dev@spark.apache.org >, Sean Owen < srowen@gmail.com >, André
> Mello < asmello.br@gmail.com >
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
> 
> 
> 
> 
>  
> 
> 
> 
> 
> 
> 
> 
> 
> 
> The challenge with the Scala/Java API in the past is that when there are
> multipe parameters, it'd lead to an explosion of function overloads. 
> 
> 
> 
> 
>  
> 
> 
> 
> 
>  
> 
> 
> 
> On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheung_m@hotmail.com >
> wrote:
> 
> 
>> 
>> 
>> I hear three topics in this thread
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 1. I don’t think we should remove string. Column and string can both be
>> “type safe”. And I would agree we don’t *need* to break API compatibility
>> here.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 2. Gaps in python API. Extending on #1, definitely we should be consistent
>> and add string as param where it is missed.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 3. Scala API for string - hard to say but make sense if nothing but for
>> consistency. Though I can also see the argument of Column only in Scala.
>> String might be more natural in python and much less significant in Scala
>> because of $”foo” notation.
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> (My 2 c)
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> 
>> *From:* Sean Owen < srowen@gmail.com >
>> *Sent:* Sunday, February 24, 2019 6:59 AM
>> *To:* André Mello
>> *Cc:* dev
>> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
>> functions
>> 
>> 
>> 
>>  
>> 
>> 
>> 
>> 
>> I just commented on the PR -- I personally don't think it's worth
>> removing support for, say, max("foo") over max(col("foo")) or
>> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
>> seems like it would unnecessarily break a lot of code. The string arg
>> is more concise in Python and I can't think of cases where it's
>> particularly ambiguous or confusing; on the contrary it's more natural
>> coming from SQL.
>> 
>> What we do have are inconsistencies and errors in support of string vs
>> Column as fixed in the PR. I was surprised to see that
>> df.select [df.select] (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=
>> ) (abs("col")) throws an error while df.select [df.select] (
>> https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=
>> ) (sqrt("col"))
>> doesn't. I think that's easy to fix on the Python side. Really I think
>> the question is: do we need to add methods like "def abs(String)" and
>> more in Scala? that would remain inconsistent even if the Pyspark side
>> is fixed.
>> 
>> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello.br@gmail.com > wrote:
>> 
>> >
>> > # Context
>> >
>> > This comes from [SPARK-26979], which became PR #23879 and then PR
>> > #23882. The following reflects all the findings made so far.
>> >
>> > # Description
>> >
>> > Currently, in the Scala API, some SQL functions have two overloads,
>> > one taking a string that names the column to be operated on, the other
>> > taking a proper Column object. This allows for two patterns of calling
>> > these functions, which is a source of inconsistency and generates
>> > confusion for new users, since it is hard to predict which functions
>> > will take a column name or not.
>> >
>> > The PySpark API partially solves this problem by internally converting
>> > the argument to a Column object prior to passing it through to the
>> > underlying JVM implementation. This allows for a consistent use of
>> > name literals across the API, except for a few violations:
>> >
>> > - lower()
>> > - upper()
>> > - abs()
>> > - bitwiseNOT()
>> > - ltrim()
>> > - rtrim()
>> > - trim()
>> > - ascii()
>> > - base64()
>> > - unbase64()
>> >
>> > These violations happen because for a subset of the SQL functions,
>> > PySpark uses a functional mechanism (`_create_function`) to directly
>> > call the underlying JVM equivalent by name, thus skipping the
>> > conversion step. In most cases the column name pattern still works
>> > because the Scala API has its own support for string arguments, but
>> > the aforementioned functions are also exceptions there.
>> >
>> > My proposal was to solve this problem by adding the string support
>> > where it was missing in the PySpark API. Since this is a purely
>> > additive change, it doesn't break past code. Additionally, I find the
>> > API sugar to be a positive feature, since code like `max("foo")` is
>> > more concise and readable than `max(col("foo"))`. It adheres to the
>> > DRY philosophy and is consistent with Python's preference for
>> > readability over type protection.
>> >
>> > However, upon submission of the PR, a discussion was started about
>> > whether it wouldn't be better to entirely deprecate string support
>> > instead - in particular with major release 3.0 in mind. The reasoning,
>> > as I understood it, was that this approach is more explicit and type
>> > safe, which is preferred in Java/Scala, plus it reduces the API
>> > surface area - and the Python API should be consistent with the others
>> > as well.
>> >
>> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
>> > by this mailing list.
>> >
>> > # Summary
>> >
>> > There is a problem with inconsistency in the Scala/Python SQL API,
>> > where sometimes you can use a column name string as a proxy, and
>> > sometimes you have to use a proper Column object. To solve it there
>> > are two approaches - to remove the string support entirely, or to add
>> > it where it is missing. Which approach is best?
>> >
>> > Hope this is clear.
>> >
>> > -- André.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: dev-unsubscribe@spark.apache.org (
>> dev-unsubscribe@spark.apache.org )
>> >
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org (
>> dev-unsubscribe@spark.apache.org )
>> 
>> 
>> 
> 
>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Posted by André Mello <am...@palantir.com>.

Hey everyone,

Progress has been made with PR #23882<https://github.com/apache/spark/pull/23882>, and it is now in a state where it could be merged with master.

This is what we’re doing for now:

  1.  PySpark will support strings consistently throughout its API.
     *   Arguably string support makes syntax closer to SQL and Scala, where you can use similar shorthands to specify columns, and the general direction of the PySpark API has been to be consistent with those other two;
     *   This is a small, additive change that will not break anything;
     *   The reason support was not there in the first place was because the code that generated functions was originally designed for aggregators, which all support column names, but it was being used for other functions (e.g. lower, abs) that did not, so it seems like it was not intentional.
We are NOT going to:

  1.  Make any code changes in Scala;
     *   This requires first deciding if string support is desirable or not;
  2.  Decide whether or not strings should be supported in the Scala API;
     *   This requires a larger discussion and the above changes are independent of this;
  3.  Make PySpark support Column objects where it currently only supports strings (e.g. multi-argument version of drop());
     *   Converting from Column to column name is not something the API does right now, so this is a stronger change;
     *   This can be considered separately.
  4.  Do anything with R for now.
     *   Anyone is free to take on this, but I have no experience with R.

If you folks agree with this, let us know, so we can move forward with the merge.

Best.

-- André.

From: Reynold Xin <rx...@databricks.com>
Date: Monday, 25 February 2019 at 00:49
To: Felix Cheung <fe...@hotmail.com>
Cc: dev <de...@spark.apache.org>, Sean Owen <sr...@gmail.com>, André Mello <as...@gmail.com>
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

The challenge with the Scala/Java API in the past is that when there are multipe parameters, it'd lead to an explosion of function overloads.

On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung <fe...@hotmail.com>> wrote:
I hear three topics in this thread

1. I don’t think we should remove string. Column and string can both be “type safe”. And I would agree we don’t *need* to break API compatibility here.

2. Gaps in python API. Extending on #1, definitely we should be consistent and add string as param where it is missed.

3. Scala API for string - hard to say but make sense if nothing but for consistency. Though I can also see the argument of Column only in Scala. String might be more natural in python and much less significant in Scala because of $”foo” notation.

(My 2 c)

________________________________
From: Sean Owen <sr...@gmail.com>>
Sent: Sunday, February 24, 2019 6:59 AM
To: André Mello
Cc: dev
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select [df.select]<https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=>(abs("col")) throws an error while df.select [df.select]<https://urldefense.proofpoint.com/v2/url?u=http-3A__df.select_&d=DwMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=Q8z7q05IotYos8ff_hQuw0iwPH3qaLxV1IKxvqk-djY&m=0WT5JP6QGu6HeMsItzix27CBSwNfQiH9NHnmqVr0WFc&s=dEvDkbQgr79-nt7QL6ZvI5bLBtkHixh2Pw8NzIH7mfw&e=>(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello <as...@gmail.com>> wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org<ma...@spark.apache.org>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Posted by Reynold Xin <rx...@databricks.com>.

The challenge with the Scala/Java API in the past is that when there are multipe parameters, it'd lead to an explosion of function overloads. 

On Sun, Feb 24, 2019 at 3:22 PM, Felix Cheung < felixcheung_m@hotmail.com > wrote:

> 
> I hear three topics in this thread
> 
> 
> 1. I don’t think we should remove string. Column and string can both be
> “type safe”. And I would agree we don’t *need* to break API compatibility
> here.
> 
> 
> 2. Gaps in python API. Extending on #1, definitely we should be consistent
> and add string as param where it is missed.
> 
> 
> 3. Scala API for string - hard to say but make sense if nothing but for
> consistency. Though I can also see the argument of Column only in Scala.
> String might be more natural in python and much less significant in Scala
> because of $”foo” notation.
> 
> 
> (My 2 c)
> 
> 
> 
>  
> *From:* Sean Owen < srowen@ gmail. com ( srowen@gmail.com ) >
> *Sent:* Sunday, February 24, 2019 6:59 AM
> *To:* André Mello
> *Cc:* dev
> *Subject:* Re: [DISCUSS][SQL][PySpark] Column name support for SQL
> functions
>  
> I just commented on the PR -- I personally don't think it's worth
> removing support for, say, max("foo") over max(col("foo")) or
> max($"foo") in Scala. We can make breaking changes in Spark 3 but this
> seems like it would unnecessarily break a lot of code. The string arg
> is more concise in Python and I can't think of cases where it's
> particularly ambiguous or confusing; on the contrary it's more natural
> coming from SQL.
> 
> What we do have are inconsistencies and errors in support of string vs
> Column as fixed in the PR. I was surprised to see that
> df. select ( http://df.select/ ) (abs("col")) throws an error while df. select
> ( http://df.select/ ) (sqrt("col"))
> doesn't. I think that's easy to fix on the Python side. Really I think
> the question is: do we need to add methods like "def abs(String)" and
> more in Scala? that would remain inconsistent even if the Pyspark side
> is fixed.
> 
> On Sun, Feb 24, 2019 at 8:54 AM André Mello < asmello. br@ gmail. com (
> asmello.br@gmail.com ) > wrote:
> >
> > # Context
> >
> > This comes from [SPARK-26979], which became PR #23879 and then PR
> > #23882. The following reflects all the findings made so far.
> >
> > # Description
> >
> > Currently, in the Scala API, some SQL functions have two overloads,
> > one taking a string that names the column to be operated on, the other
> > taking a proper Column object. This allows for two patterns of calling
> > these functions, which is a source of inconsistency and generates
> > confusion for new users, since it is hard to predict which functions
> > will take a column name or not.
> >
> > The PySpark API partially solves this problem by internally converting
> > the argument to a Column object prior to passing it through to the
> > underlying JVM implementation. This allows for a consistent use of
> > name literals across the API, except for a few violations:
> >
> > - lower()
> > - upper()
> > - abs()
> > - bitwiseNOT()
> > - ltrim()
> > - rtrim()
> > - trim()
> > - ascii()
> > - base64()
> > - unbase64()
> >
> > These violations happen because for a subset of the SQL functions,
> > PySpark uses a functional mechanism (`_create_function`) to directly
> > call the underlying JVM equivalent by name, thus skipping the
> > conversion step. In most cases the column name pattern still works
> > because the Scala API has its own support for string arguments, but
> > the aforementioned functions are also exceptions there.
> >
> > My proposal was to solve this problem by adding the string support
> > where it was missing in the PySpark API. Since this is a purely
> > additive change, it doesn't break past code. Additionally, I find the
> > API sugar to be a positive feature, since code like `max("foo")` is
> > more concise and readable than `max(col("foo"))`. It adheres to the
> > DRY philosophy and is consistent with Python's preference for
> > readability over type protection.
> >
> > However, upon submission of the PR, a discussion was started about
> > whether it wouldn't be better to entirely deprecate string support
> > instead - in particular with major release 3.0 in mind. The reasoning,
> > as I understood it, was that this approach is more explicit and type
> > safe, which is preferred in Java/Scala, plus it reduces the API
> > surface area - and the Python API should be consistent with the others
> > as well.
> >
> > Upon request by @HyukjinKwon I'm submitting this matter for discussion
> > by this mailing list.
> >
> > # Summary
> >
> > There is a problem with inconsistency in the Scala/Python SQL API,
> > where sometimes you can use a column name string as a proxy, and
> > sometimes you have to use a proper Column object. To solve it there
> > are two approaches - to remove the string support entirely, or to add
> > it where it is missing. Which approach is best?
> >
> > Hope this is clear.
> >
> > -- André.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscribe@spark.apache.org )
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@ spark. apache. org (
> dev-unsubscribe@spark.apache.org )
>

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Posted by Felix Cheung <fe...@hotmail.com>.

I hear three topics in this thread

1. I don’t think we should remove string. Column and string can both be “type safe”. And I would agree we don’t *need* to break API compatibility here.

2. Gaps in python API. Extending on #1, definitely we should be consistent and add string as param where it is missed.

3. Scala API for string - hard to say but make sense if nothing but for consistency. Though I can also see the argument of Column only in Scala. String might be more natural in python and much less significant in Scala because of $”foo” notation.

(My 2 c)


________________________________
From: Sean Owen <sr...@gmail.com>
Sent: Sunday, February 24, 2019 6:59 AM
To: André Mello
Cc: dev
Subject: Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello <as...@gmail.com> wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org

Re: [DISCUSS][SQL][PySpark] Column name support for SQL functions

Posted by Sean Owen <sr...@gmail.com>.

I just commented on the PR -- I personally don't think it's worth
removing support for, say, max("foo") over max(col("foo")) or
max($"foo") in Scala. We can make breaking changes in Spark 3 but this
seems like it would unnecessarily break a lot of code. The string arg
is more concise in Python and I can't think of cases where it's
particularly ambiguous or confusing; on the contrary it's more natural
coming from SQL.

What we do have are inconsistencies and errors in support of string vs
Column as fixed in the PR. I was surprised to see that
df.select(abs("col")) throws an error while df.select(sqrt("col"))
doesn't. I think that's easy to fix on the Python side. Really I think
the question is: do we need to add methods like "def abs(String)" and
more in Scala? that would remain inconsistent even if the Pyspark side
is fixed.

On Sun, Feb 24, 2019 at 8:54 AM André Mello <as...@gmail.com> wrote:
>
> # Context
>
> This comes from [SPARK-26979], which became PR #23879 and then PR
> #23882. The following reflects all the findings made so far.
>
> # Description
>
> Currently, in the Scala API, some SQL functions have two overloads,
> one taking a string that names the column to be operated on, the other
> taking a proper Column object. This allows for two patterns of calling
> these functions, which is a source of inconsistency and generates
> confusion for new users, since it is hard to predict which functions
> will take a column name or not.
>
> The PySpark API partially solves this problem by internally converting
> the argument to a Column object prior to passing it through to the
> underlying JVM implementation. This allows for a consistent use of
> name literals across the API, except for a few violations:
>
> - lower()
> - upper()
> - abs()
> - bitwiseNOT()
> - ltrim()
> - rtrim()
> - trim()
> - ascii()
> - base64()
> - unbase64()
>
> These violations happen because for a subset of the SQL functions,
> PySpark uses a functional mechanism (`_create_function`) to directly
> call the underlying JVM equivalent by name, thus skipping the
> conversion step. In most cases the column name pattern still works
> because the Scala API has its own support for string arguments, but
> the aforementioned functions are also exceptions there.
>
> My proposal was to solve this problem by adding the string support
> where it was missing in the PySpark API. Since this is a purely
> additive change, it doesn't break past code. Additionally, I find the
> API sugar to be a positive feature, since code like `max("foo")` is
> more concise and readable than `max(col("foo"))`. It adheres to the
> DRY philosophy and is consistent with Python's preference for
> readability over type protection.
>
> However, upon submission of the PR, a discussion was started about
> whether it wouldn't be better to entirely deprecate string support
> instead - in particular with major release 3.0 in mind. The reasoning,
> as I understood it, was that this approach is more explicit and type
> safe, which is preferred in Java/Scala, plus it reduces the API
> surface area - and the Python API should be consistent with the others
> as well.
>
> Upon request by @HyukjinKwon I'm submitting this matter for discussion
> by this mailing list.
>
> # Summary
>
> There is a problem with inconsistency in the Scala/Python SQL API,
> where sometimes you can use a column name string as a proxy, and
> sometimes you have to use a proper Column object. To solve it there
> are two approaches - to remove the string support entirely, or to add
> it where it is missing. Which approach is best?
>
> Hope this is clear.
>
> -- André.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org