You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nikhil Goyal <no...@gmail.com> on 2023/04/20 20:29:05 UTC

Partition by on dataframe causing a Sort

Hi folks,

We are writing a dataframe and doing a partitionby() on it.
df.write.partitionBy('col').parquet('output')

Job is running super slow because internally per partition it is doing a
sort before starting to output to the final location. This sort isn't
useful in any way since # of files will remain the same. I was wondering if
we can have spark just open multiple file pointers and keep appending data
as it receives and close all the pointers when it's done. This will reduce
the memory footprint and will speed up the performance as we will
eliminate a sort. We can implement a custom source but unable to see if we
can really control this behavior in the sink. If anyone has any suggestions
please let me know.

Thanks
Nikhil

Re: Partition by on dataframe causing a Sort

Posted by Nikhil Goyal <no...@gmail.com>.

Is it possible to use MultipleOutputs and define a custom OutputFormat and
then use `saveAsHadoopFile` to be able to achieve this?

On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal <no...@gmail.com> wrote:

> Hi folks,
>
> We are writing a dataframe and doing a partitionby() on it.
> df.write.partitionBy('col').parquet('output')
>
> Job is running super slow because internally per partition it is doing a
> sort before starting to output to the final location. This sort isn't
> useful in any way since # of files will remain the same. I was wondering if
> we can have spark just open multiple file pointers and keep appending data
> as it receives and close all the pointers when it's done. This will reduce
> the memory footprint and will speed up the performance as we will
> eliminate a sort. We can implement a custom source but unable to see if we
> can really control this behavior in the sink. If anyone has any suggestions
> please let me know.
>
> Thanks
> Nikhil
>