You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2016/11/17 11:47:58 UTC

[jira] [Updated] (SPARK-18489) Implicit type conversion during comparision between Integer type column and String type column

     [ https://issues.apache.org/jira/browse/SPARK-18489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Herman van Hovell updated SPARK-18489:
--------------------------------------
    Description: 
Suppose I have a dataframe with schema:
{noformat}
root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: string (nullable = true)
{noformat}

and data:
{noformat}
+---+---+----+
|_c0|_c1| _c2|
+---+---+----+
|  1|1.0|   1|
|  2|1.0|   s|
|  3|3.1|null|
+---+---+----+
{noformat}
if the following operations are carried out:
{noformat}
df.where("_c1==_c2").show
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|1.0|  1|
+---+---+---+
{noformat}

{noformat}
df.where("_c1<>_c2").show   or   df.where("_c1!=_c2").show 
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
+---+---+---+
{noformat}
So the related operation results are ambiguous
Here the stringified numeric values are being Implicitly casted where the others are just ignored instead of throwing an exception
In my view these things can lead to incorrect results if dataset is not properly observed. 

Also SQL-99 standard discourages implicit casting to avoid such things.
https://users.dcc.uchile.cl/~cgutierr/cursos/BD/standards.pdf

The same implicit casting is also there for UDFs and aggregation functions.




  was:
Suppose I have a dataframe with schema:
root
 |-- _c0: integer (nullable = true)
 |-- _c1: double (nullable = true)
 |-- _c2: string (nullable = true)


and data:
+---+---+----+
|_c0|_c1| _c2|
+---+---+----+
|  1|1.0|   1|
|  2|1.0|   s|
|  3|3.1|null|
+---+---+----+
if the following operations are carried out:
df.where("_c1==_c2").show
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|1.0|  1|
+---+---+---+

df.where("_c1<>_c2").show   or   df.where("_c1!=_c2").show 
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
+---+---+---+
So the related operation results are ambiguous
Here the stringified numeric values are being Implicitly casted where the others are just ignored instead of throwing an exception
In my view these things can lead to incorrect results if dataset is not properly observed. 

Also SQL-99 standard discourages implicit casting to avoid such things.
https://users.dcc.uchile.cl/~cgutierr/cursos/BD/standards.pdf

The same implicit casting is also there for UDFs and aggregation functions.





> Implicit type conversion during comparision between Integer type column and String type column
> ----------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18489
>                 URL: https://issues.apache.org/jira/browse/SPARK-18489
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: Bipul Kumar
>
> Suppose I have a dataframe with schema:
> {noformat}
> root
>  |-- _c0: integer (nullable = true)
>  |-- _c1: double (nullable = true)
>  |-- _c2: string (nullable = true)
> {noformat}
> and data:
> {noformat}
> +---+---+----+
> |_c0|_c1| _c2|
> +---+---+----+
> |  1|1.0|   1|
> |  2|1.0|   s|
> |  3|3.1|null|
> +---+---+----+
> {noformat}
> if the following operations are carried out:
> {noformat}
> df.where("_c1==_c2").show
> +---+---+---+
> |_c0|_c1|_c2|
> +---+---+---+
> |  1|1.0|  1|
> +---+---+---+
> {noformat}
> {noformat}
> df.where("_c1<>_c2").show   or   df.where("_c1!=_c2").show 
> +---+---+---+
> |_c0|_c1|_c2|
> +---+---+---+
> +---+---+---+
> {noformat}
> So the related operation results are ambiguous
> Here the stringified numeric values are being Implicitly casted where the others are just ignored instead of throwing an exception
> In my view these things can lead to incorrect results if dataset is not properly observed. 
> Also SQL-99 standard discourages implicit casting to avoid such things.
> https://users.dcc.uchile.cl/~cgutierr/cursos/BD/standards.pdf
> The same implicit casting is also there for UDFs and aggregation functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org