You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Harper (JIRA)" <ji...@apache.org> on 2016/01/26 14:19:39 UTC
[jira] [Created] (SPARK-13000) Corrupted results when using LIMIT clause via JDBC connections to ThriftServer

Daniel Harper created SPARK-13000:
-------------------------------------

             Summary: Corrupted results when using LIMIT clause via JDBC connections to ThriftServer
                 Key: SPARK-13000
                 URL: https://issues.apache.org/jira/browse/SPARK-13000
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 1.5.2
         Environment: Amazon EMR AMI 4.2.0
Spark 1.5.2
            Reporter: Daniel Harper


h2. Steps to reproduce

#. Create table in HIVE  (see below for definition)
#. Insert some data (at least 2 rows) 
#. Start thrift service
#. Connect to thrift service via {{beeline}} or custom application via JDBC
#. Run query {{select * from logs_table limit 1}}

h2. Detailed description

We're seeing strange results for the following query when executed via JDBC connections to the thrift server 

{code}
select * from logs_table limit 1;
{code}

We've tried this using {{beeline}} and as you can see, the {{service}} and other columns are blank 

{code}
[hadoop@ip-x ~]$ beeline
Beeline version 1.0.0-amzn-1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10001/default
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10001/default
Enter username for jdbc:hive2://localhost:10001/default:
Enter password for jdbc:hive2://localhost:10001/default:
Connected to: Spark SQL (version 1.5.2)
Driver: Hive JDBC (version 1.0.0-amzn-1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |   |   |   |   |
+------------------------+----------+-------+-----+-----+--+
1 row selected (9.182 seconds)
{code}

Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are present.

{code}
0: jdbc:hive2://localhost:10001/default> select * from logs_table;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |service_1  | 2016  | 01  | 24  |
| 2016-01-24 23:29:24.0  |service_4  | 2016  | 01  | 24  |
+------------------------+----------+-------+-----+-----+--+
2 rows selected (10.956 seconds)
{code}

I ran the query {{select * from logs_table limit 1}} via 

* {{spark-sql}}
* {{spark-shell}}

...and both returned the expected results, limiting the resultset to 1 row and with all the columns populated.

This leads me to believe this is an issue with the Thrift Server or Hive JDBC driver.

We are starting the thrift server as follows:

{code}
sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 --executor-memory 38G --conf spark.scheduler.mode=FAIR --conf spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
{code}

h2. Resources 

The HIVE table is defined as follows: 

{code}
CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
        ts STRING,
        service STRING
)
COMMENT 'logs table'
PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
STORED AS TEXTFILE
LOCATION 's3://data-lake/structured/';
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org