You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Maurin Lenglart <ma...@cuberonlabs.com> on 2016/07/26 19:02:45 UTC

dynamic coalesce to pick file size

Hi,
I am doing a Sql query that return a Dataframe. Then I am writing the result of the query using “df.write”, but the result get written in a lot of different small files (~100 of 200 ko). So now I am doing a “.coalesce(2)” before the write.
But the number “2” that I picked is static, is there have a way of dynamically picking the number depending of the file size wanted? (around 256mb would be perfect)

I am running spark 1.6 on CDH using yarn, the files are written in parquet format.

Thanks


Re: dynamic coalesce to pick file size

Posted by Pedro Rodriguez <sk...@gmail.com>.
I asked something similar if you search for "Tools for Balancing Partitions
By Size" (I couldn't find link on archives). Unfortunately there doesn't
seem to be something good right now other than knowing your job statistics.
I am planning on implementing the idea I explained in the last paragraph or
so of the last email I sent in this library
https://github.com/EntilZha/spark-s3 although it could be a while to make
my way up to data frames (adds for now).

On Tue, Jul 26, 2016 at 1:02 PM, Maurin Lenglart <ma...@cuberonlabs.com>
wrote:

> Hi,
>
> I am doing a Sql query that return a Dataframe. Then I am writing the
> result of the query using “df.write”, but the result get written in a lot
> of different small files (~100 of 200 ko). So now I am doing a
> “.coalesce(2)” before the write.
>
> But the number “2” that I picked is static, is there have a way of
> dynamically picking the number depending of the file size wanted? (around
> 256mb would be perfect)
>
>
>
> I am running spark 1.6 on CDH using yarn, the files are written in parquet
> format.
>
>
>
> Thanks
>
>
>



-- 
Pedro Rodriguez
PhD Student in Distributed Machine Learning | CU Boulder
UC Berkeley AMPLab Alumni

ski.rodriguez@gmail.com | pedrorodriguez.io | 909-353-4423
Github: github.com/EntilZha | LinkedIn:
https://www.linkedin.com/in/pedrorodriguezscience