You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2016/08/31 23:51:20 UTC

[jira] [Commented] (SPARK-17341) Can't read Parquet data with fields containing periods "."

    [ https://issues.apache.org/jira/browse/SPARK-17341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15453767#comment-15453767 ] 

Hyukjin Kwon commented on SPARK-17341:
--------------------------------------

Is this a duplicate of SPARK-16698? I believe this does not happen in current master. Could you confirm it please?

> Can't read Parquet data with fields containing periods "."
> ----------------------------------------------------------
>
>                 Key: SPARK-17341
>                 URL: https://issues.apache.org/jira/browse/SPARK-17341
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Don Drake
>
> I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have encountered a showstopper problem with Parquet dataset that have fields containing a "." in a field name.  This data comes from an external provider (CSV) and we just pass through the field names.  This has worked flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  
> {code}
> Spark context available as 'sc' (master = local[*], app id = local-1472664486578).
> Spark session available as 'spark'.
> Welcome to
>       ____              __
>      / __/__  ___ _____/ /__
>     _\ \/ _ \/ _ `/ __/  '_/
>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>       /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "squared.value")
> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> squaresDF.take(2)
> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
> scala> squaresDF.write.parquet("squares")
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 96.54% for 7 writers
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
> Scaling row group sizes to 84.47% for 8 writers
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT
> scala> val inSquare = spark.read.parquet("squares")
> inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]
> scala> inSquare.take(2)
> org.apache.spark.sql.AnalysisException: Unable to resolve squared.value given [value, squared.value];
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>   at scala.Option.getOrElse(Option.scala:121)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
>   at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
>   at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
>   at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>   ... 48 elided
> scala>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org