You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Daniel Harper (JIRA)" <ji...@apache.org> on 2016/01/26 14:20:39 UTC

[jira] [Updated] (SPARK-13000) Corrupted results when using LIMIT clause via JDBC connections to ThriftServer

     [ https://issues.apache.org/jira/browse/SPARK-13000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Harper updated SPARK-13000:
----------------------------------
    Description: 
h2. Steps to reproduce

# Create table in HIVE  (see below for definition)
# Insert some data (at least 2 rows) 
# Start thrift service
# Connect to thrift service via {{beeline}} or custom application via JDBC
# Run query {{select * from logs_table limit 1}}

h2. Detailed description

We're seeing strange results for the following query when executed via JDBC connections to the thrift server 

{code}
select * from logs_table limit 1;
{code}

We've tried this using {{beeline}} and as you can see, the {{service}} and other columns are blank 

{code}
[hadoop@ip-x ~]$ beeline
Beeline version 1.0.0-amzn-1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10001/default
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10001/default
Enter username for jdbc:hive2://localhost:10001/default:
Enter password for jdbc:hive2://localhost:10001/default:
Connected to: Spark SQL (version 1.5.2)
Driver: Hive JDBC (version 1.0.0-amzn-1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |   |   |   |   |
+------------------------+----------+-------+-----+-----+--+
1 row selected (9.182 seconds)
{code}

Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are present.

{code}
0: jdbc:hive2://localhost:10001/default> select * from logs_table;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |service_1  | 2016  | 01  | 24  |
| 2016-01-24 23:29:24.0  |service_4  | 2016  | 01  | 24  |
+------------------------+----------+-------+-----+-----+--+
2 rows selected (10.956 seconds)
{code}

I ran the query {{select * from logs_table limit 1}} via 

* {{spark-sql}}
* {{spark-shell}}

...and both returned the expected results, limiting the resultset to 1 row and with all the columns populated.

This leads me to believe this is an issue with the Thrift Server or Hive JDBC driver.

We are starting the thrift server as follows:

{code}
sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 --executor-memory 38G --conf spark.scheduler.mode=FAIR --conf spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
{code}

h2. Resources 

The HIVE table is defined as follows: 

{code}
CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
        ts STRING,
        service STRING
)
COMMENT 'logs table'
PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
STORED AS TEXTFILE
LOCATION 's3://data-lake/structured/';
{code}

  was:
h2. Steps to reproduce

#. Create table in HIVE  (see below for definition)
#. Insert some data (at least 2 rows) 
#. Start thrift service
#. Connect to thrift service via {{beeline}} or custom application via JDBC
#. Run query {{select * from logs_table limit 1}}

h2. Detailed description

We're seeing strange results for the following query when executed via JDBC connections to the thrift server 

{code}
select * from logs_table limit 1;
{code}

We've tried this using {{beeline}} and as you can see, the {{service}} and other columns are blank 

{code}
[hadoop@ip-x ~]$ beeline
Beeline version 1.0.0-amzn-1 by Apache Hive
beeline> !connect jdbc:hive2://localhost:10001/default
scan complete in 5ms
Connecting to jdbc:hive2://localhost:10001/default
Enter username for jdbc:hive2://localhost:10001/default:
Enter password for jdbc:hive2://localhost:10001/default:
Connected to: Spark SQL (version 1.5.2)
Driver: Hive JDBC (version 1.0.0-amzn-1)
Transaction isolation: TRANSACTION_REPEATABLE_READ
0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |   |   |   |   |
+------------------------+----------+-------+-----+-----+--+
1 row selected (9.182 seconds)
{code}

Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are present.

{code}
0: jdbc:hive2://localhost:10001/default> select * from logs_table;
+------------------------+----------+-------+-----+-----+--+
|           ts           | service  | yyyy  | mm  | dd  |
+------------------------+----------+-------+-----+-----+--+
| 2016-01-24 23:23:24.0  |service_1  | 2016  | 01  | 24  |
| 2016-01-24 23:29:24.0  |service_4  | 2016  | 01  | 24  |
+------------------------+----------+-------+-----+-----+--+
2 rows selected (10.956 seconds)
{code}

I ran the query {{select * from logs_table limit 1}} via 

* {{spark-sql}}
* {{spark-shell}}

...and both returned the expected results, limiting the resultset to 1 row and with all the columns populated.

This leads me to believe this is an issue with the Thrift Server or Hive JDBC driver.

We are starting the thrift server as follows:

{code}
sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 --executor-memory 38G --conf spark.scheduler.mode=FAIR --conf spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
{code}

h2. Resources 

The HIVE table is defined as follows: 

{code}
CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
        ts STRING,
        service STRING
)
COMMENT 'logs table'
PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
STORED AS TEXTFILE
LOCATION 's3://data-lake/structured/';
{code}


> Corrupted results when using LIMIT clause via JDBC connections to ThriftServer
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-13000
>                 URL: https://issues.apache.org/jira/browse/SPARK-13000
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.2
>         Environment: Amazon EMR AMI 4.2.0
> Spark 1.5.2
>            Reporter: Daniel Harper
>
> h2. Steps to reproduce
> # Create table in HIVE  (see below for definition)
> # Insert some data (at least 2 rows) 
> # Start thrift service
> # Connect to thrift service via {{beeline}} or custom application via JDBC
> # Run query {{select * from logs_table limit 1}}
> h2. Detailed description
> We're seeing strange results for the following query when executed via JDBC connections to the thrift server 
> {code}
> select * from logs_table limit 1;
> {code}
> We've tried this using {{beeline}} and as you can see, the {{service}} and other columns are blank 
> {code}
> [hadoop@ip-x ~]$ beeline
> Beeline version 1.0.0-amzn-1 by Apache Hive
> beeline> !connect jdbc:hive2://localhost:10001/default
> scan complete in 5ms
> Connecting to jdbc:hive2://localhost:10001/default
> Enter username for jdbc:hive2://localhost:10001/default:
> Enter password for jdbc:hive2://localhost:10001/default:
> Connected to: Spark SQL (version 1.5.2)
> Driver: Hive JDBC (version 1.0.0-amzn-1)
> Transaction isolation: TRANSACTION_REPEATABLE_READ
> 0: jdbc:hive2://localhost:10001/default> select * from logs_table limit 1;
> +------------------------+----------+-------+-----+-----+--+
> |           ts           | service  | yyyy  | mm  | dd  |
> +------------------------+----------+-------+-----+-----+--+
> | 2016-01-24 23:23:24.0  |   |   |   |   |
> +------------------------+----------+-------+-----+-----+--+
> 1 row selected (9.182 seconds)
> {code}
> Removing the {{LIMIT 1}} clause, we get the full dataset and all columns are present.
> {code}
> 0: jdbc:hive2://localhost:10001/default> select * from logs_table;
> +------------------------+----------+-------+-----+-----+--+
> |           ts           | service  | yyyy  | mm  | dd  |
> +------------------------+----------+-------+-----+-----+--+
> | 2016-01-24 23:23:24.0  |service_1  | 2016  | 01  | 24  |
> | 2016-01-24 23:29:24.0  |service_4  | 2016  | 01  | 24  |
> +------------------------+----------+-------+-----+-----+--+
> 2 rows selected (10.956 seconds)
> {code}
> I ran the query {{select * from logs_table limit 1}} via 
> * {{spark-sql}}
> * {{spark-shell}}
> ...and both returned the expected results, limiting the resultset to 1 row and with all the columns populated.
> This leads me to believe this is an issue with the Thrift Server or Hive JDBC driver.
> We are starting the thrift server as follows:
> {code}
> sudo /usr/lib/spark/sbin/start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001 --num-executors 1 --executor-cores 5 --executor-memory 38G --conf spark.scheduler.mode=FAIR --conf spark.sql.thriftserver.scheduler.pool=default --driver-memory 10G
> {code}
> h2. Resources 
> The HIVE table is defined as follows: 
> {code}
> CREATE EXTERNAL TABLE IF NOT EXISTS logs_table (
>         ts STRING,
>         service STRING
> )
> COMMENT 'logs table'
> PARTITIONED BY (yyyy STRING, mm STRING, dd STRING)
> STORED AS TEXTFILE
> LOCATION 's3://data-lake/structured/';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org