You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Exie <tf...@prodevelop.com.au> on 2015/07/01 03:20:14 UTC

1.4.0

So I was delighted with Spark 1.3.1 using Parquet 1.6.0 which would
"partition" data into folders. So I set up some parquet data paritioned by
date. This enabled is to reference a single day/month/year minimizing how
much data was scanned.

eg:
val myDataFrame =
hiveContext.read.parquet("s3n://myBucket/myPath/2014/07/01")
or
val myDataFrame = hiveContext.read.parquet("s3n://myBucket/myPath/2014/07")

However since upgrading to Spark 1.4.0 it doesnt seem to be working the same
way. 
The first line works, in the "01" folder is all the normal files:
2015-06-02 20:01         0   s3://myBucket/myPath/2014/07/01/_SUCCESS
2015-06-02 20:01      2066  
s3://myBucket/myPath/2014/07/01/_common_metadata
2015-06-02 20:01   1077190   s3://myBucket/myPath/2014/07/01/_metadata
2015-06-02 19:57    119933  
s3://myBucket/myPath/2014/07/01/part-r-00001.parquet
2015-06-02 19:57     48478  
s3://myBucket/myPath/2014/07/01/part-r-00002.parquet
2015-06-02 19:57    576878  
s3://myBucket/myPath/2014/07/01/part-r-00003.parquet

... but if I now use the second line above, to read in all days, it comes
back empty.

Is there an option I can set somewhere to fix this ?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/1-4-0-tp23556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org