You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Michael Allman (JIRA)" <ji...@apache.org> on 2016/10/18 17:13:58 UTC

[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

    [ https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586018#comment-15586018 ] 

Michael Allman commented on SPARK-17983:
----------------------------------------

cc [~rxin]

I had a feeling there might be some fallout like this. It seems we really need to reconcile Hive metastore column names to on-disk column names as part of planning.

I think I mentioned, and I actually have, a implemented this kind of reconciliation that occurs after partition pruning in optimization so that it only involves the partitions in the query plan. Obviously, this is big improvement over the original behavior which did a scan over every data file in the table.

Even though this adds some additional cost to the query planning, I believe this can be restricted to the first access of a partition in a given Spark session. The straightforward solution would be to cache the table metadata incrementally as partitions are scanned. Subsequent requests for partition schema and metadata would come from the cache. The cache would be invalidated through the usual methods.

This follows along the lines of the "re-add partition caching" task I mentioned in the beginning of https://github.com/apache/spark/pull/14690.

Thoughts?

> Can't filter over mixed case parquet columns of converted Hive tables
> ---------------------------------------------------------------------
>
>                 Key: SPARK-17983
>                 URL: https://issues.apache.org/jira/browse/SPARK-17983
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Eric Liang
>            Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with metastore schema at planning time, which seems pretty messy, and (2) fixing all the datasources to support case-insensitive matching, which also has issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
>     spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as partCol2").write
>       .partitionBy("partCol1", "partCol2")
>       .mode("overwrite")
>       .parquet(dir.getAbsolutePath)
>     spark.sql(s"""
>       |create external table $tableName (normalCol long)
>       |partitioned by (partCol1 int, partCol2 int)
>       |stored as parquet
>       |location "${dir.getAbsolutePath}"""".stripMargin)
>     spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
>     withTable("test") {
>       withTempDir { dir =>
>         setupPartitionedTable("test", dir)
>         val df = spark.sql("select * from test where normalCol = 3")
>         assert(df.count() == 1)
>       }
>     }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org