You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Marco Gaido (JIRA)" <ji...@apache.org> on 2018/02/09 12:08:00 UTC
[jira] [Commented] (SPARK-23373) Can not execute "count distinct" queries on parquet formatted table

    [ https://issues.apache.org/jira/browse/SPARK-23373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16358315#comment-16358315 ] 

Marco Gaido commented on SPARK-23373:
-------------------------------------

I cannot reproduce on current master... May you try and check whether the issue still exists?

> Can not execute "count distinct" queries on parquet formatted table
> -------------------------------------------------------------------
>
>                 Key: SPARK-23373
>                 URL: https://issues.apache.org/jira/browse/SPARK-23373
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.0
>            Reporter: Wang, Gang
>            Priority: Major
>
> I failed to run sql "select count(distinct n_name) from nation", table nation is formatted in Parquet, error trace is as following.
> _spark-sql> select count(distinct n_name) from nation;_
>  _18/02/09 03:55:28 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_
>  _Error in query: Table or view not found: nation; line 1 pos 35_
>  _spark-sql> select count(distinct n_name) from nation_parquet;_
>  _18/02/09 03:55:36 INFO main SparkSqlParser:54 Parsing command: select count(distinct n_name) from nation_parquet_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: int_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: string_
>  _18/02/09 03:55:36 INFO main CatalystSqlParser:54 Parsing command: array<string>_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Pruning directories with:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Data Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Post-Scan Filters:_
>  _18/02/09 03:55:38 INFO main FileSourceStrategy:54 Output Data Schema: struct<n_name: string>_
>  _18/02/09 03:55:38 INFO main FileSourceScanExec:54 Pushed Filters:_
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 295.88685 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 51.075394 ms_
>  _18/02/09 03:55:39 INFO main HashAggregateExec:54 spark.sql.codegen.aggregate.map.twolevel.enable is set to true, but current version of codegened fast hashmap does not support this aggregate._
>  _18/02/09 03:55:39 INFO main CodeGenerator:54 Code generated in 42.819226 ms_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 parquetFilterPushDown is true_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 start filter class_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 Pushed not defined_
>  _18/02/09 03:55:39 INFO main ParquetFileFormat:54 end filter class_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0 stored as values in memory (estimated size 305.0 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO main MemoryStore:54 Block broadcast_0_piece0 stored as bytes in memory (estimated size 27.6 KB, free 366.0 MB)_
>  _18/02/09 03:55:39 INFO dispatcher-event-loop-7 BlockManagerInfo:54 Added broadcast_0_piece0 in memory on 10.64.205.170:45616 (size: 27.6 KB, free: 366.3 MB)_
>  _18/02/09 03:55:39 INFO main SparkContext:54 Created broadcast 0 from processCmd at CliDriver.java:376_
>  _18/02/09 03:55:39 INFO main InMemoryFileIndex:54 Selected files after partition pruning:_
>  _PartitionDirectory([empty row],ArrayBuffer(LocatedFileStatus\{path=hdfs://******************.com:8020/apps/hive/warehouse/nation_parquet/000000_0; isDirectory=false; length=3216; replication=3; blocksize=134217728; modification_time=1516619879024; access_time=0; owner=; group=; permission=rw-rw-rw-; isSymlink=false}))_
>  _18/02/09 03:55:39 INFO main FileSourceScanExec:54 Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes._
>  _18/02/09 03:55:39 ERROR main SparkSQLDriver:91 Failed in [select count(distinct n_name) from nation_parquet]_
>  {color:#ff0000}*_org.apache.spark.SparkException: Task not serializable_*{color}
>  _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)_
>  _at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)_
>  _at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:840)_
>  _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:389)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)_
>  _at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)_
>  _at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)_
>  _at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)_
>  _at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)_
>  _at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)_
>  _at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)_
>  _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)_
>  _at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange.prepareShuffleDependency(ShuffleExchange.scala:88)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:124)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange$$anonfun$doExecute$1.apply(ShuffleExchange.scala:115)_
>  _at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)_
>  _at org.apache.spark.sql.execution.exchange.ShuffleExchange.doExecute(ShuffleExchange.scala:115)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)_
>  _at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)_
>  _at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:252)_
>  _at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:141)_
>  _at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:386)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)_
>  _at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)_
>  _at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)_
>  _at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:228)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:275)_
>  _at org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:298)_
>  _at org.apache.spark.sql.execution.QueryExecution.hiveResultString(QueryExecution.scala:133)_
>  _at org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)_
>  _at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:340)_
>  _at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:376)_
>  _at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:248)_
>  _at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)_
>  _at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)_
>  _at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)_
>  _at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)_
>  _at java.lang.reflect.Method.invoke(Method.java:498)_
>  _at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)_
>  _at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)_
>  _at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)_
>  _at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)_
>  _at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)_
>  _Caused by: java.io.NotSerializableException: scala.concurrent.impl.ExecutionContextImpl$$anon$1_
>  _Serialization stack:_
>  _- object not serializable (class: scala.concurrent.impl.ExecutionContextImpl$$anon$1, value: scala.concurrent.impl.ExecutionContextImpl$$anon$1@149e457)_
>  _- field (class: org.apache.spark.sql.execution.FileSourceScanExec, name: org$apache$spark$sql$execution$FileSourceScanExec$$executionContext, type: interface scala.concurrent.ExecutionContextExecutorService)_
>  _- object (class org.apache.spark.sql.execution.FileSourceScanExec, FileScan parquet default.nation_parquet[n_name#1|#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://__******************.com:8020__/apps/hive/warehouse/nation_par..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<n_name:string>, UsedIndexes: []_
>  _)_
>  _- field (class: org.apache.spark.sql.execution.aggregate.HashAggregateExec, name: child, type: class org.apache.spark.sql.execution.SparkPlan)_
>  _- object (class org.apache.spark.sql.execution.aggregate.HashAggregateExec, HashAggregate(keys=[n_name#1|#1], functions=[], output=[n_name#1|#1])_
>  _+- FileScan parquet default.nation_parquet[n_name#1|#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://__******************.com:8020__/apps/hive/warehouse/nation_par..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<n_name:string>, UsedIndexes: []_
>  _)_
>  _- element of array (index: 0)_
>  _- array (class [Ljava.lang.Object;, size 7)_
>  _- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, name: references$1, type: class [Ljava.lang.Object;)_
>  _- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8, <function2>)_
>  _at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)_
>  _at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)_
>  _at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)_
>  _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:337)_
>  _... 75 more_
>  _org.apache.spark.SparkException: Task not serializable_
>  _at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)_
>  _at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)_
>  _at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)_
>  _at org.apache.spark.SparkContext.clean(SparkContext.scala:2294)_
>  _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:841)_
>  _at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:840)_
>  _at org.apache.spark.rdd.RDDOperationScope$.withS..._
>  
> And I tried the same query on table formatted in TXT, it worked good.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org