You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Exie <tf...@prodevelop.com.au> on 2015/07/01 03:33:16 UTC

Spark 1.4.0: Parquet partitions / folder hierarchy changed from 1.3.1

So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
"partition" data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned. 

eg: 
val myDataFrame =
hiveContext.read.parquet("s3n://myBucket/myPath/2014/07/01") 
or 
val myDataFrame = hiveContext.read.parquet("s3n://myBucket/myPath/2014/07") 

However since upgrading to Spark 1.4.0 it doesnt seem to be working the same
way. 
The first line works, in the "01" folder is all the normal files: 
2015-06-02 20:01         0   s3://myBucket/myPath/2014/07/01/_SUCCESS 
2015-06-02 20:01      2066  
s3://myBucket/myPath/2014/07/01/_common_metadata 
2015-06-02 20:01   1077190   s3://myBucket/myPath/2014/07/01/_metadata 
2015-06-02 19:57    119933  
s3://myBucket/myPath/2014/07/01/part-r-00001.parquet 
2015-06-02 19:57     48478  
s3://myBucket/myPath/2014/07/01/part-r-00002.parquet 
2015-06-02 19:57    576878  
s3://myBucket/myPath/2014/07/01/part-r-00003.parquet 

... but if I now use the second line above, to read in all days, it comes
back empty. 

Is there an option I can set somewhere to fix this ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-4-0-Parquet-partitions-folder-hierarchy-changed-from-1-3-1-tp23558.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org