You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/01/27 10:09:25 UTC

[jira] [Commented] (SPARK-19383) Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option

    [ https://issues.apache.org/jira/browse/SPARK-19383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15842488#comment-15842488 ] 

Sean Owen commented on SPARK-19383:
-----------------------------------

This sounds like unsupported syntax. I'm not even sure "PER PARTITION LIMIT" exists in Hive?

> Spark Sql Fails with Cassandra 3.6 and later PER PARTITION LIMIT option 
> ------------------------------------------------------------------------
>
>                 Key: SPARK-19383
>                 URL: https://issues.apache.org/jira/browse/SPARK-19383
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.2
>         Environment: PER PARTITION LIMIT Error documented in github and reproducible by cloning: [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job]
> Java 1.8
> Cassandra Version
> [cqlsh 5.0.1 | Cassandra 3.9.0 | CQL spec 3.4.2 | Native protocol v4]
> {code:title=POM.xml|borderStyle=solid}
> <dependency>
>             <groupId>com.datastax.spark</groupId>
>             <artifactId>spark-cassandra-connector_2.10</artifactId>
>             <version>2.0.0-M3</version>
>         </dependency>
>         <dependency>
>             <groupId>com.datastax.cassandra</groupId>
>             <artifactId>cassandra-driver-mapping</artifactId>
>             <version>3.1.2</version>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.hadoop</groupId>
>             <artifactId>hadoop-common</artifactId>
>             <version>2.72</version>
>             <scope>compile</scope>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.spark</groupId>
>             <artifactId>spark-catalyst_2.10</artifactId>
>             <version>2.0.2</version>
>             <scope>compile</scope>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.spark</groupId>
>             <artifactId>spark-core_2.10</artifactId>
>             <version>2.0.2</version>
>             <scope>compile</scope>
>         </dependency>
>         <dependency>
>             <groupId>org.apache.spark</groupId>
>             <artifactId>spark-sql_2.10</artifactId>
>             <version>2.0.2</version>
>             <scope>compile</scope>
>         </dependency>
> {code}
>            Reporter: Brent Dorsey
>            Priority: Minor
>              Labels: Cassandra
>
> Attempting to use version 2.0.0-M3 of the datastax/spark-cassandra-connector to select the most recent version of each partition key using the Cassandra 3.6 and later PER PARTITION LIMIT option fails. I've tried using all the Cassandra Java RDD's and Spark Sql with and without partition key equality constraints. All attempts have failed due to syntax errors and/or start/end bound restriction errors.
> The [BrentDorsey/cassandra-spark-job|https://github.com/BrentDorsey/cassandra-spark-job] repo contains working code that demonstrates the error. Clone the repo, create the keyspace and table locally and supply connection information then run main.
> Spark Dataset .where & Spark Sql Errors:
> {code:title=errors|borderStyle=solid}
> ERROR [2017-01-27 06:35:19,919] (main) org.per.partition.limit.test.spark.job.Main: getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> mismatched input 'PARTITION' expecting <EOF>(line 1, pos 67)
> == SQL ==
> TOKEN(item_uuid) > TOKEN(6616b548-4fd1-4661-a938-0af3c77357f7) PER PARTITION LIMIT 1
> -------------------------------------------------------------------^^^
> 	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
> 	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
> 	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
> 	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:43)
> 	at org.apache.spark.sql.Dataset.where(Dataset.scala:1153)
> 	at org.per.partition.limit.test.spark.job.Main.getSparkDatasetPerPartitionLimitTestWithTokenGreaterThan(Main.java:349)
> 	at org.per.partition.limit.test.spark.job.Main.run(Main.java:128)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> 	at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> ERROR [2017-01-27 06:35:20,238] (main) org.per.partition.limit.test.spark.job.Main: getSparkSqlDatasetPerPartitionLimitTest failed.
> org.apache.spark.sql.catalyst.parser.ParseException: 
> extraneous input ''' expecting {'(', 'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IF', '+', '-', '*', 'DIV', '~', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'USING', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'FORMATTED', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', 'CURRENT_DATE', 'CURRENT_TIMESTAMP', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, INTEGER_VALUE, DECIMAL_VALUE, SCIENTIFIC_DECIMAL_VALUE, DOUBLE_LITERAL, BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 36)
> == SQL ==
> SELECT item_uuid, time_series_date, 'item_uri FROM perPartitionLimitTests PER PARTITION LIMIT 1
> ------------------------------------^^^
> 	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
> 	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
> 	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:45)
> 	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
> 	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
> 	at org.per.partition.limit.test.spark.job.Main.getSparkSqlDatasetPerPartitionLimitTest(Main.java:367)
> 	at org.per.partition.limit.test.spark.job.Main.run(Main.java:129)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> 	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
> 	at org.per.partition.limit.test.spark.job.Main.main(Main.java:72)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:497)
> 	at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org