You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "shea.parkes" <sh...@gmail.com> on 2016/10/19 02:04:22 UTC

How does Spark determine in-memory partition count when reading Parquet ~files?

When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame
object with far fewer in-memory partitions.

I'm happy to troubleshoot this further, but I don't know Scala well and
could use some help pointing me in the right direction.  Where should I be
looking in the code base to understand how many partitions will result from
reading a parquet ~file?

Thanks,

Shea



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-determine-in-memory-partition-count-when-reading-Parquet-files-tp27921.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

Posted by Michael Armbrust <mi...@databricks.com>.

In spark 2.0 we bin-pack small files into a single task to avoid
overloading the scheduler.  If you want a specific number of partitions you
should repartition.  If you want to disable this optimization you can set
the file open cost very high: spark.sql.files.openCostInBytes

On Tue, Oct 18, 2016 at 7:04 PM, shea.parkes <sh...@gmail.com> wrote:

> When reading a parquet ~file with >50 parts, Spark is giving me a DataFrame
> object with far fewer in-memory partitions.
>
> I'm happy to troubleshoot this further, but I don't know Scala well and
> could use some help pointing me in the right direction.  Where should I be
> looking in the code base to understand how many partitions will result from
> reading a parquet ~file?
>
> Thanks,
>
> Shea
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/How-does-Spark-determine-in-
> memory-partition-count-when-reading-Parquet-files-tp27921.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: How does Spark determine in-memory partition count when reading Parquet ~files?

Posted by "shea.parkes" <sh...@gmail.com>.

Thank you for the reply tosaurabh85.  We do tune and adjust our shuffle
partition count, but that was not influencing the reading of parquets (the
data is not shuffled as it is read as I understand it).

I apologize that I actually received an answer, but it was not caught on the
mailing list here.  I'm posting the thread here below for future people to
find the answer as well:



On Wed, Oct 19, 2016 at 9:33 PM Michael Armbrust <xx...@databricks.com> wrote:
In spark 2.0 we bin-pack small files into a single task to avoid overloading
the scheduler.  If you want a specific number of partitions you should
repartition.  If you want to disable this optimization you can set the file
open cost very high: spark.sql.files.openCostInBytes

My reply:

Thank you very much for that information sir.  It does make sense, I just
did not find that in any release notes.  I will work to tune that parameter
appropriately for our work flow.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-does-Spark-determine-in-memory-partition-count-when-reading-Parquet-files-tp27921p27943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org