You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Prithish <pr...@gmail.com> on 2016/10/27 12:19:29 UTC

Reading AVRO from S3 - No parallelism

I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0.
No matter how many executors I use or what configuration changes I make,
the cluster doesn't seem to use all the executors. I am using the
com.databricks.spark.avro library from databricks to read the AVRO.

However, if I try the same on CSV files (same S3 folder, same configuration
and cluster), it does use all executors.

Is there something that I need to do to enable parallelism when using the
AVRO databricks library?

Thanks for your help.

Re: Reading AVRO from S3 - No parallelism

Posted by pr...@gmail.com.

 
 
The Avro files were 500-600kb in size and that folder contained around 1200 files. The total folder size was around 600mb. Will try repartition. Thank you.
 
   
 
 
 

 
 
 
 

 
 
>  
> On Oct 28, 2016 at 2:24 AM,  <Michael Armbrust (mailto:michael@databricks.com)>  wrote:
>  
>  
>  
> How big are your avro files?    We collapse many small files into a single partition to eliminate scheduler overhead.    If you need explicit parallelism you can also repartition.
>  
>
>  
> On Thu, Oct 27, 2016 at 5:19 AM, Prithish  <prithish@gmail.com (mailto:prithish@gmail.com)>  wrote:
>  
> >  
> >  
> > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. No matter how many executors I use or what configuration changes I make, the cluster doesn't seem to use all the executors. I am using the com.databricks.spark.avro library from databricks to read the AVRO.  
> >  
> >
> >  
> > However, if I try the same on CSV files (same S3 folder, same configuration and cluster), it does use all executors.  
> >  
> >
> >  
> > Is there something that I need to do to enable parallelism when using the AVRO databricks library?
> >  
> >
> >  
> > Thanks for your help.  
> >  
> >
> >  
> >
> >        
>

Re: Reading AVRO from S3 - No parallelism

Posted by Michael Armbrust <mi...@databricks.com>.

How big are your avro files?  We collapse many small files into a single
partition to eliminate scheduler overhead.  If you need explicit
parallelism you can also repartition.

On Thu, Oct 27, 2016 at 5:19 AM, Prithish <pr...@gmail.com> wrote:

> I am trying to read a bunch of AVRO files from a S3 folder using Spark
> 2.0. No matter how many executors I use or what configuration changes I
> make, the cluster doesn't seem to use all the executors. I am using the
> com.databricks.spark.avro library from databricks to read the AVRO.
>
> However, if I try the same on CSV files (same S3 folder, same
> configuration and cluster), it does use all executors.
>
> Is there something that I need to do to enable parallelism when using the
> AVRO databricks library?
>
> Thanks for your help.
>
>
>