You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Vishal Donderia (JIRA)" <ji...@apache.org> on 2019/07/30 11:58:00 UTC

[jira] [Created] (SPARK-28563) Spark 2.4 | Reading all the data inside partition like directory.

Vishal Donderia created SPARK-28563:
---------------------------------------

             Summary: Spark 2.4 |  Reading all the data inside partition like directory. 
                 Key: SPARK-28563
                 URL: https://issues.apache.org/jira/browse/SPARK-28563
             Project: Spark
          Issue Type: Bug
          Components: Input/Output
    Affects Versions: 2.4.1
            Reporter: Vishal Donderia


We have upgraded your cluster from Spark 2.3 to 2.4 and currently, we are observing different behavior while reading data. 

 

In Spark 2.3 
      spark.read.('basePath','output/model').orc('output/model/abc=4')

Expected: We will get "abc" column  in schema


Similarly:

 spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

Expected : It will only read data inside parition abc=4 and abc will not be part of schema even "output/model" has different schema of files inside 


In Spark2.4

spark.read.('basePath','output/model/abc=4').orc('output/model/abc=4')

It is trying to get the schema from "output/model/" instead of  output/model/abc=4  and job is getting failed because of different schema

{code}
For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:

at scala.Predef$.assert(Predef.scala:170)
 at org.apache.spark.sql.execution.datasources.PartitioningUtils$.resolvePartitions(PartitioningUtils.scala:364)
 at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:165)
 at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:100)
 at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:131)
 at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:71)
 at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
 at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:144)
 at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:373)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
 at org.apache.spark.sql.DataFrameReader.orc(DataFrameReader.scala:662)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
 at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
 at py4j.Gateway.invoke(Gateway.java:282)
 at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
 at py4j.commands.CallCommand.execute(CallCommand.java:79)
 at py4j.GatewayConnection.run(GatewayConnection.java:238)
 at java.lang.Thread.run(Thread.java:745)

{code} 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org