You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Steve Loughran <st...@hortonworks.com> on 2017/07/05 11:36:07 UTC
Re: Spark querying parquet data partitioned in S3

> On 29 Jun 2017, at 17:44, fran <fr...@hivehome.com> wrote:
> 
> We have got data stored in S3 partitioned by several columns. Let's say
> following this hierarchy:
> s3://bucket/data/column1=X/column2=Y/parquet-files
> 
> We run a Spark job in a EMR cluster (1 master,3 slaves) and realised the
> following:
> 
> A) - When we declare the initial dataframe to be the whole dataset (val df =
> sqlContext.read.parquet("s3://bucket/data/) then the driver splits the job
> into several tasks (259) that are performed by the executors and we believe
> the driver gets back the parquet metadata.
> 
> Question: The above takes about 25 minutes for our dataset, we believe it
> should be a lazy query (as we are not performing any actions) however it
> looks like something is happening, all the executors are reading from S3. We
> have tried mergeData=false and setting the schema explicitly via
> .schema(someSchema). Is there any way to speed this up?
> 
> B) - When we declare the initial dataframe to be scoped by the first column
> (val df = sqlContext.read.parquet("s3://bucket/data/column1=X) then it seems
> that all the work (getting the parquet metadata) is done by the driver and
> there is no job submitted to Spark. 
> 
> Question: Why does (A) send the work to executors but (B) does not?
> 
> The above is for EMR 5.5.0, Hadoop 2.7.3 and Spark 2.1.0.
> 
> 

Split calculation can be very slow against object stores, especially if the directory structure is deep: the treewalking done here is pretty inefficient against the object store.

Then there's the schema merge, which looks at the tail of every file, so has to do a seek() against all of them. That is something which it parallelises around the cluster, before your job actually gets scheduled. 

Turning that off with spark.sql.parquet.mergeSchema = false should make it go away, but clearly not.

Aa call to jstack against the driver will show where it is at: you'll probably have to start from there


I know if you are using EMR you are stuck using Amazon's own s3 ciients; if you were on Apache's own artifacts you could move up to Hadoop 2.8 and set the spark.hadoop.fs.s3a.experimental.fadvise=random option for high speed random access. You can also turn off job summary creation in Spark


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org