You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Adam Borochoff (JIRA)" <ji...@apache.org> on 2015/07/28 01:16:04 UTC

[jira] [Commented] (SPARK-5151) Parquet Predicate Pushdown Does Not Work with Nested Structures.

    [ https://issues.apache.org/jira/browse/SPARK-5151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14643588#comment-14643588 ] 

Adam Borochoff commented on SPARK-5151:
---------------------------------------

Hey guys,

Does this get fixed in Spark 1.5 when you upgrade your parquet version?

Parquet bug fix:
https://issues.apache.org/jira/browse/PARQUET-116

Spark parquet upgrade:
https://issues.apache.org/jira/browse/SPARK-7743


> Parquet Predicate Pushdown Does Not Work with Nested Structures.
> ----------------------------------------------------------------
>
>                 Key: SPARK-5151
>                 URL: https://issues.apache.org/jira/browse/SPARK-5151
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0
>         Environment: pyspark, spark-ec2 created cluster
>            Reporter: Brad Willard
>              Labels: parquet, pyspark, sql
>
> I have json files of objects created with a nested structure roughly of the formof the form:
> { id: 123, event: "login", meta_data: {'user: "user1"}}
> ....
> { id: 125, event: "login", meta_data: {'user: "user2"}}
> I load the data via spark with
> rdd = sql_context.jsonFile()
> # save it as a parquet file
> rdd.saveAsParquetFile()
> rdd = sql_context.parquetFile()
> rdd.registerTempTable('events')
> so if I run this query it works without issue if predicate pushdown is disabled
> select count(1) from events where meta_data.user = "user1"
> if I enable predicate pushdown I get an error saying meta_data.user is not in the schema
> Py4JJavaError: An error occurred while calling o218.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 125 in stage 12.0 failed 4 times, most recent failure: Lost task 125.3 in stage 12.0 (TID 6164, ): java.lang.IllegalArgumentException: Column [user] was not found in schema!
> 	at parquet.Preconditions.checkArgument(Preconditions.java:47)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
> 	at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
> 	at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
> .....
> I expect this is actually related to another bug I filed where nested structure is not preserved with spark sql.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org