You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Prithish <pr...@gmail.com> on 2016/10/27 12:19:29 UTC
Reading AVRO from S3 - No parallelism
I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0.
No matter how many executors I use or what configuration changes I make,
the cluster doesn't seem to use all the executors. I am using the
com.databricks.spark.avro library from databricks to read the AVRO.
However, if I try the same on CSV files (same S3 folder, same configuration
and cluster), it does use all executors.
Is there something that I need to do to enable parallelism when using the
AVRO databricks library?
Thanks for your help.
Re: Reading AVRO from S3 - No parallelism
Posted by pr...@gmail.com.
The Avro files were 500-600kb in size and that folder contained around 1200 files. The total folder size was around 600mb. Will try repartition. Thank you.
>
> On Oct 28, 2016 at 2:24 AM, <Michael Armbrust (mailto:michael@databricks.com)> wrote:
>
>
>
> How big are your avro files? We collapse many small files into a single partition to eliminate scheduler overhead. If you need explicit parallelism you can also repartition.
>
>
>
> On Thu, Oct 27, 2016 at 5:19 AM, Prithish <prithish@gmail.com (mailto:prithish@gmail.com)> wrote:
>
> >
> >
> > I am trying to read a bunch of AVRO files from a S3 folder using Spark 2.0. No matter how many executors I use or what configuration changes I make, the cluster doesn't seem to use all the executors. I am using the com.databricks.spark.avro library from databricks to read the AVRO.
> >
> >
> >
> > However, if I try the same on CSV files (same S3 folder, same configuration and cluster), it does use all executors.
> >
> >
> >
> > Is there something that I need to do to enable parallelism when using the AVRO databricks library?
> >
> >
> >
> > Thanks for your help.
> >
> >
> >
> >
> >
>
Re: Reading AVRO from S3 - No parallelism
Posted by Michael Armbrust <mi...@databricks.com>.
How big are your avro files? We collapse many small files into a single
partition to eliminate scheduler overhead. If you need explicit
parallelism you can also repartition.
On Thu, Oct 27, 2016 at 5:19 AM, Prithish <pr...@gmail.com> wrote:
> I am trying to read a bunch of AVRO files from a S3 folder using Spark
> 2.0. No matter how many executors I use or what configuration changes I
> make, the cluster doesn't seem to use all the executors. I am using the
> com.databricks.spark.avro library from databricks to read the AVRO.
>
> However, if I try the same on CSV files (same S3 folder, same
> configuration and cluster), it does use all executors.
>
> Is there something that I need to do to enable parallelism when using the
> AVRO databricks library?
>
> Thanks for your help.
>
>
>