You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Ajay Srivastava <a_...@yahoo.com.INVALID> on 2016/07/07 06:53:08 UTC

SPARK-8813 - combining small files in spark sql

Hi,
This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark 2.0.But resolution is not mentioned there.
In our use case, there are big as well as many small parquet files which are being queried using spark sql.Can someone please explain what is the fix and how I can use it in spark 2.0 ? I did search commits done in 2.0 branch and looks like I need to use spark.sql.files.openCostInBytes but I am not sure.


Regards,Ajay

Re: SPARK-8813 - combining small files in spark sql

Posted by Reynold Xin <rx...@databricks.com>.

When using native data sources (e.g. Parquet, ORC, JSON, ...), partitions
are automatically merged so they would add up to a specific size,
configurable by spark.sql.files.maxPartitionBytes.

spark.sql.files.openCostInBytes is used to specify the cost of each "file".
That is, an empty file will be considered to have at
least spark.sql.files.openCostInBytes bytes.

On Wed, Jul 6, 2016 at 11:53 PM, Ajay Srivastava <
a_k_srivastava@yahoo.com.invalid> wrote:

> Hi,
>
> This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in
> spark 2.0.
> But resolution is not mentioned there.
>
> In our use case, there are big as well as many small parquet files which
> are being queried using spark sql.
> Can someone please explain what is the fix and how I can use it in spark
> 2.0 ? I did search commits done in 2.0 branch and looks like I need to
> use spark.sql.files.openCostInBytes but I am not sure.
>
>
> Regards,
> Ajay
>

Re: SPARK-8813 - combining small files in spark sql

Posted by Sean Owen <so...@cloudera.com>.

-user

Reynold made the comment that he thinks this was resolved by another
change; maybe he can comment.

On Thu, Jul 7, 2016 at 7:53 AM, Ajay Srivastava
<a_...@yahoo.com.invalid> wrote:
> Hi,
>
> This jira https://issues.apache.org/jira/browse/SPARK-8813 is fixed in spark
> 2.0.
> But resolution is not mentioned there.
>
> In our use case, there are big as well as many small parquet files which are
> being queried using spark sql.
> Can someone please explain what is the fix and how I can use it in spark 2.0
> ? I did search commits done in 2.0 branch and looks like I need to use
> spark.sql.files.openCostInBytes but I am not sure.
>
>
> Regards,
> Ajay

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org