You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Brad Willard (JIRA)" <ji...@apache.org> on 2015/06/05 23:25:01 UTC
[jira] [Created] (SPARK-8128) Dataframe Fails to Recognize Column
in Schema
Brad Willard created SPARK-8128:
-----------------------------------
Summary: Dataframe Fails to Recognize Column in Schema
Key: SPARK-8128
URL: https://issues.apache.org/jira/browse/SPARK-8128
Project: Spark
Issue Type: Bug
Components: PySpark, Spark Core
Affects Versions: 1.3.1
Reporter: Brad Willard
I'm loading a folder of parquet files with about 600 parquet files and loading it into one dataframe so schema merging is involved. There is some bug with the schema merging that you print the schema and it shows and attributes. However when you run a query and filter on that attribute is errors saying it's not in the schema.
I think this bug could be related to an attribute name being reused in a nested object. "mediaProcessingState" appears twice in the schema and is the problem.
sdf = sql_context.parquet('/parquet/big_data_folder')
sdf.printSchema()
root
|-- _id: string (nullable = true)
|-- addedOn: string (nullable = true)
|-- attachment: string (nullable = true)
.......
|-- items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _id: string (nullable = true)
| | |-- addedOn: string (nullable = true)
| | |-- authorId: string (nullable = true)
| | |-- mediaProcessingState: long (nullable = true)
|-- mediaProcessingState: long (nullable = true)
|-- title: string (nullable = true)
|-- key: string (nullable = true)
sdf.filter(sdf.mediaProcessingState == 3).count()
causes this exception
Py4JJavaError: An error occurred while calling o67.count.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1106 in stage 4.0 failed 30 times, most recent failure: Lost task 1106.29 in stage 4.0 (TID 70565, XXXXXXXXXXXXXXX): java.lang.IllegalArgumentException: Column [mediaProcessingState] was not found in schema!
at parquet.Preconditions.checkArgument(Preconditions.java:47)
at parquet.filter2.predicate.SchemaCompatibilityValidator.getColumnDescriptor(SchemaCompatibilityValidator.java:172)
at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumn(SchemaCompatibilityValidator.java:160)
at parquet.filter2.predicate.SchemaCompatibilityValidator.validateColumnFilterPredicate(SchemaCompatibilityValidator.java:142)
at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:76)
at parquet.filter2.predicate.SchemaCompatibilityValidator.visit(SchemaCompatibilityValidator.java:41)
at parquet.filter2.predicate.Operators$Eq.accept(Operators.java:162)
at parquet.filter2.predicate.SchemaCompatibilityValidator.validate(SchemaCompatibilityValidator.java:46)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:41)
at parquet.filter2.compat.RowGroupFilter.visit(RowGroupFilter.java:22)
at parquet.filter2.compat.FilterCompat$FilterPredicateCompat.accept(FilterCompat.java:108)
at parquet.filter2.compat.RowGroupFilter.filterRowGroups(RowGroupFilter.java:28)
at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:158)
at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:138)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.<init>(NewHadoopRDD.scala:133)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:104)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:66)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org