You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Don Drake <do...@gmail.com> on 2016/09/01 00:05:05 UTC

Re: Spark 2.0 - Parquet data with fields containing periods "."

Yes, I just tested it against the nightly build from 8/31.  Looking at the
PR, I'm happy the test added verifies my issue.

Thanks.

-Don

On Wed, Aug 31, 2016 at 6:49 PM, Hyukjin Kwon <gu...@gmail.com> wrote:

> Hi Don, I guess this should be fixed from 2.0.1.
>
> Please refer this PR. https://github.com/apache/spark/pull/14339
>
> On 1 Sep 2016 2:48 a.m., "Don Drake" <do...@gmail.com> wrote:
>
>> I am in the process of migrating a set of Spark 1.6.2 ETL jobs to Spark
>> 2.0 and have encountered some interesting issues.
>>
>> First, it seems the SQL parsing is different, and I had to rewrite some
>> SQL that was doing a mix of inner joins (using where syntax, not inner) and
>> outer joins to get the SQL to work.  It was complaining about columns not
>> existing.  I can't reproduce that one easily and can't share the SQL.  Just
>> curious if anyone else is seeing this?
>>
>> I do have a showstopper problem with Parquet dataset that have fields
>> containing a "." in the field name.  This data comes from an external
>> provider (CSV) and we just pass through the field names.  This has worked
>> flawlessly in Spark 1.5 and 1.6, but now spark can't seem to read these
>> parquet files.
>>
>> I've reproduced a trivial example below. Jira created:
>> https://issues.apache.org/jira/browse/SPARK-17341
>>
>>
>> Spark context available as 'sc' (master = local[*], app id =
>> local-1472664486578).
>> Spark session available as 'spark'.
>> Welcome to
>>       ____              __
>>      / __/__  ___ _____/ /__
>>     _\ \/ _ \/ _ `/ __/  '_/
>>    /___/ .__/\_,_/_/ /_/\_\   version 2.0.0
>>       /_/
>>
>> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java
>> 1.7.0_51)
>> Type in expressions to have them evaluated.
>> Type :help for more information.
>>
>> scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i
>> * i)).toDF("value", "squared.value")
>> 16/08/31 12:28:44 WARN ObjectStore: Version information not found in
>> metastore. hive.metastore.schema.verification is not enabled so
>> recording the schema version 1.2.0
>> 16/08/31 12:28:44 WARN ObjectStore: Failed to get database default,
>> returning NoSuchObjectException
>> squaresDF: org.apache.spark.sql.DataFrame = [value: int, squared.value:
>> int]
>>
>> scala> squaresDF.take(2)
>> res0: Array[org.apache.spark.sql.Row] = Array([1,1], [2,4])
>>
>> scala> squaresDF.write.parquet("squares")
>> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
>> SLF4J: Defaulting to no-operation (NOP) logger implementation
>> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for
>> further details.
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.codec.CodecConfig:
>> Compression: SNAPPY
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet block size to 134217728
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Parquet dictionary page size to 1048576
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Dictionary is on
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Validation is off
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ParquetOutputFormat:
>> Writer version is: PARQUET_1_0
>> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager:
>> Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
>> Scaling row group sizes to 96.54% for 7 writers
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 0
>> Aug 31, 2016 12:29:08 PM WARNING: org.apache.parquet.hadoop.MemoryManager:
>> Total allocation exceeds 95.00% (906,992,000 bytes) of heap memory
>> Scaling row group sizes to 84.47% for 8 writers
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 0
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 16
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 16
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 16
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 16
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.InternalParquetRecordWriter:
>> Flushing mem columnStore to file. allocated memory: 16
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
>> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
>> encodings: [BIT_PACKED, PLAIN]
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
>> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
>> encodings: [BIT_PACKED, PLAIN]
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
>> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
>> encodings: [BIT_PACKED, PLAIN]
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
>> written 39B for [value] INT32: 1 values, 4B raw, 6B comp, 1 pages,
>> encodings: [BIT_PACKED, PLAIN]
>> Aug 31, 2016 12:29:08 PM INFO: org.apache.parquet.hadoop.ColumnChunkPageWriteStore:
>> written 39B for [value] INT
>> scala> val inSquare = spark.read.parquet("squares")
>> inSquare: org.apache.spark.sql.DataFrame = [value: int, squared.value:
>> int]
>>
>> scala> inSquare.take(2)
>> org.apache.spark.sql.AnalysisException: Unable to resolve squared.value
>> given [value, squared.value];
>>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ano
>> nfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ano
>> nfun$resolve$1$$anonfun$apply$5.apply(LogicalPlan.scala:134)
>>   at scala.Option.getOrElse(Option.scala:121)
>>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ano
>> nfun$resolve$1.apply(LogicalPlan.scala:133)
>>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$ano
>> nfun$resolve$1.apply(LogicalPlan.scala:129)
>>   at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>>   at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:234)
>>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>>   at org.apache.spark.sql.types.StructType.foreach(StructType.scala:95)
>>   at scala.collection.TraversableLike$class.map(TraversableLike.
>> scala:234)
>>   at org.apache.spark.sql.types.StructType.map(StructType.scala:95)
>>   at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.reso
>> lve(LogicalPlan.scala:129)
>>   at org.apache.spark.sql.execution.datasources.FileSourceStrateg
>> y$.apply(FileSourceStrategy.scala:87)
>>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun
>> $1.apply(QueryPlanner.scala:60)
>>   at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun
>> $1.apply(QueryPlanner.scala:60)
>>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>>   at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(Que
>> ryPlanner.scala:61)
>>   at org.apache.spark.sql.execution.SparkPlanner.plan(SparkPlanne
>> r.scala:47)
>>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$
>> $anonfun$apply$1.applyOrElse(SparkPlanner.scala:51)
>>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1$
>> $anonfun$apply$1.applyOrElse(SparkPlanner.scala:48)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transf
>> ormUp$1.apply(TreeNode.scala:301)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transf
>> ormUp$1.apply(TreeNode.scala:301)
>>   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigi
>> n(TreeNode.scala:69)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(Tre
>> eNode.scala:300)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.appl
>> y(TreeNode.scala:298)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.appl
>> y(TreeNode.scala:298)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.appl
>> y(TreeNode.scala:321)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductItera
>> tor(TreeNode.scala:179)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildr
>> en(TreeNode.scala:319)
>>   at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(Tre
>> eNode.scala:298)
>>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.
>> apply(SparkPlanner.scala:48)
>>   at org.apache.spark.sql.execution.SparkPlanner$$anonfun$plan$1.
>> apply(SparkPlanner.scala:48)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan$
>> lzycompute(QueryExecution.scala:78)
>>   at org.apache.spark.sql.execution.QueryExecution.sparkPlan(
>> QueryExecution.scala:76)
>>   at org.apache.spark.sql.execution.QueryExecution.executedPlan$
>> lzycompute(QueryExecution.scala:83)
>>   at org.apache.spark.sql.execution.QueryExecution.executedPlan(
>> QueryExecution.scala:83)
>>   at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2558)
>>   at org.apache.spark.sql.Dataset.head(Dataset.scala:1924)
>>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2139)
>>   ... 48 elided
>>
>> scala>
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> https://twitter.com/dondrake <http://www.MailLaunder.com/>
>> 800-733-2143
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
https://twitter.com/dondrake <http://www.MailLaunder.com/>
800-733-2143