You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Don Drake (JIRA)" <ji...@apache.org> on 2016/08/31 17:48:20 UTC

[jira] [Created] (SPARK-17341) Can't read Parquet data with fields containing periods "."

Don Drake created SPARK-17341:
---------------------------------

             Summary: Can't read Parquet data with fields containing periods "."
                 Key: SPARK-17341
                 URL: https://issues.apache.org/jira/browse/SPARK-17341
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Don Drake


I am porting a set of Spark 1.6.2 applications to Spark 2.0 and I have encountered a showstopper problem with Parquet dataset that have fields containing a "." in a field name.  This data comes from an external provider (CSV) and we just pass through the field names.  This has worked flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these parquet files.  

{code}
Spark context available as 'sc' (master = local[*], app id = local-1472664486578).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_51)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "squared.value")
16/08/31 12:28:44 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/08/31 12:28:44 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]

scala> squaresDF.take(2)
res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])

scala> squaresDF.write.parquet("squares")
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig: Compression: SNAPPY
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet block size to 134217728
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Parquet dictionary page size to 1048576
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Dictionary is on
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Validation is off
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_0
Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
Scaling row group sizes to 96.54% for 7 writers
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager: Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
Scaling row group sizes to 84.47% for 8 writers
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 0
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 16
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages, encodings: [BIT_PACKED, PLAIN]
Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 39B for [value] INT
scala> val inSquare = spark.read.parquet("squares")
inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value: int]

scala> inSquare.take(2)
org.apache.spark.sql.AnalysisException: Unable to resolve squared.value given [value, squared.value];
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
  at scala.Option.getOrElse(Option.scala:121)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:133)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:129)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
  at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:129)
  at org.apache.spark.sql.execution.datasources.FileSourceStrategy$.apply(FileSourceStrategy.scala:87)
  at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
  at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:60)
  at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
  at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
  at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:61)
  at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanner.scala:47)
  at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
  at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$$anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:301)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:69)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:300)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:298)
  at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:321)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:179)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:319)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:298)
  at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
  at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.apply(SparkPlanner.scala:48)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:78)
  at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:76)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:83)
  at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:83)
  at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558)
  at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
  at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
  ... 48 elided

scala>
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org