You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "CHENYANG XU (JIRA)" <ji...@apache.org> on 2018/07/18 09:24:00 UTC
[jira] [Created] (SPARK-24842) self-join query fails on different letter case for same field

CHENYANG XU created SPARK-24842:
-----------------------------------

             Summary: self-join query fails on different letter case for same field
                 Key: SPARK-24842
                 URL: https://issues.apache.org/jira/browse/SPARK-24842
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.1
            Reporter: CHENYANG XU


For both hive tables and jdbc tables, self-join queries fail under specific circumstances, which may be caused by different letter case on name of same field. All caseSensitive options in sparksql keep their default values. 

Codes to repreduce exceptions:

For Hive:

`deviceid` in hive table and `deviceId` in sql:  
{code:java}
sparkSession.sql("drop table if exists test_table")
sparkSession.sql("create table test_table(deviceid String) partitioned by (dt String) stored as parquet")
sparkSession.sql("insert into test_table values ('aaa', '20180708'), ('bbb', '20180708'), ('ccc', '20180708')")
sparkSession.sql(
      """select
         |  from_unixtime(unix_timestamp(dt, 'yyyyMMdd'),'yyyy-MM-dd') c_dt,
         |  deviceId
         |from test_table
      """.stripMargin).createOrReplaceTempView("predata")
sparkSession.sql(
      """select
         |  a.deviceId
         |from predata a left join predata b
         |on
         |  a.deviceId=b.deviceId
      """.stripMargin).count()
{code}

{panel:title=exception1}
java.util.NoSuchElementException: key not found: user_id
at 
org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anonfun$pruneSchema$1.apply(JDBCRDD.scala:84)
...
{panel}

For JDBC: 
 First, create test table in mysql(upper-case field names):
{code:sql}
create table test.test_table(USER_ID int primary key, START_TIME timestamp);
insert into test.test_table values (1, '2018-07-01'), (2, '2018-07-02'), (3, '2018-07-03');
{code}
Then, write spark codes:
{code:java}
val df = sparkSession.read.format("jdbc").options(Map("url"->"jdbc:mysql://xxx.xxx.xxx.xxx:3306", "dbtable"->"test.test_table", "user"->"bi", "password"->"bibibi")).load()
df.select("USER_ID", "START_TIME").createOrReplaceTempView("source")
sparkSession.sql(
    """select
       |  user_id
       |  ,start_time
       |  ,row_number() over(partition by user_id order by start_time) as px
       |from source
    """.stripMargin).createOrReplaceTempView("source2")
sparkSession.sql(
      """select
        |  a.user_id
        |from source2 a
        |left join source2 b
        |on a.user_id = b.user_id
      """.stripMargin).count()
{code}

{panel:title=exception2}
java.lang.NullPointerException
at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$14.apply(TableReader.scala:391)
...
{panel}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org