You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:44:13 UTC

[jira] [Resolved] (SPARK-24842) self-join query fails on different letter case for same field

     [ https://issues.apache.org/jira/browse/SPARK-24842?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-24842.
----------------------------------
    Resolution: Incomplete

> self-join query fails on different letter case for same field
> -------------------------------------------------------------
>
>                 Key: SPARK-24842
>                 URL: https://issues.apache.org/jira/browse/SPARK-24842
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: CHENYANG XU
>            Priority: Major
>              Labels: bulk-closed
>
> For both hive tables and jdbc tables, self-join queries fail under specific circumstances, which may be caused by different letter case on name of same field. All caseSensitive options in sparksql keep their default values. 
> Codes to repreduce exceptions:
> For Hive:
> `deviceid` in hive table and `deviceId` in sql:  
> {code:java}
> sparkSession.sql("drop table if exists test_table")
> sparkSession.sql("create table test_table(deviceid String) partitioned by (dt String) stored as parquet")
> sparkSession.sql("insert into test_table values ('aaa', '20180708'), ('bbb', '20180708'), ('ccc', '20180708')")
> sparkSession.sql(
>       """select
>          |  from_unixtime(unix_timestamp(dt, 'yyyyMMdd'),'yyyy-MM-dd') c_dt,
>          |  deviceId
>          |from test_table
>       """.stripMargin).createOrReplaceTempView("predata")
> sparkSession.sql(
>       """select
>          |  a.deviceId
>          |from predata a left join predata b
>          |on
>          |  a.deviceId=b.deviceId
>       """.stripMargin).count()
> {code}
> {panel:title=exception1}
> java.util.NoSuchElementException: key not found: user_id
> at 
> org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$pruneSchema$1.apply(JDBCRDD.scala:84)
> ...
> {panel}
> For JDBC: 
>  First, create test table in mysql(upper-case field names):
> {code:sql}
> create table test.test_table(USER_ID int primary key, START_TIME timestamp);
> insert into test.test_table values (1, '2018-07-01'), (2, '2018-07-02'), (3, '2018-07-03');
> {code}
> Then, write spark codes:
> {code:java}
> val df = sparkSession.read.format("jdbc").options(Map("url"->"jdbc:mysql://xxx.xxx.xxx.xxx:3306", "dbtable"->"test.test_table", "user"->"bi", "password"->"bibibi")).load()
> df.select("USER_ID", "START_TIME").createOrReplaceTempView("source")
> sparkSession.sql(
>     """select
>        |  user_id
>        |  ,start_time
>        |  ,row_number() over(partition by user_id order by start_time) as px
>        |from source
>     """.stripMargin).createOrReplaceTempView("source2")
> sparkSession.sql(
>       """select
>         |  a.user_id
>         |from source2 a
>         |left join source2 b
>         |on a.user_id = b.user_id
>       """.stripMargin).count()
> {code}
> {panel:title=exception2}
> java.lang.NullPointerException
> at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14.apply(TableReader.scala:391)
> ...
> {panel}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org