You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vishal Donderia (JIRA)" <ji...@apache.org> on 2019/07/30 11:58:00 UTC
[jira] [Created] (SPARK-28563) Spark 2.4 | Reading all the data
inside partition like directory.
Vishal Donderia created SPARK-28563:
---------------------------------------
Summary: Spark 2.4 | Reading all the data inside partition like directory.
Key: SPARK-28563
URL: https://issues.apache.org/jira/browse/SPARK-28563
Project: Spark
Issue Type: Bug
Components: Input/Output
Affects Versions: 2.4.1
Reporter: Vishal Donderia
We have upgraded your cluster from Spark 2.3 to 2.4 and currently, we are observing different behavior while reading data.
In Spark 2.3
spark.read.('basePath','output/model').orc('output/model/abc=4')
Expected: We will get "abc" column in schema
Similarly:
spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')
Expected : It will only read data inside parition abc=4 and abc will not be part of schema even "output/model" has different schema of files inside
In Spark2.4
spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')
It is trying to get the schema from "output/model/" instead of output/model/abc=4 and job is getting failed because of different schema
{code}
For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:364)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:165)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:100)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:131)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:144)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:662)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org