You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tzahi File <tz...@ironsrc.com> on 2020/08/31 14:17:29 UTC

Merging Parquet Files

Hi,

I would like to develop a process that merges parquet files.
My first intention was to develop it with PySpark using coalesce(1) -  to
create only 1 file.
This process is going to run on a huge amount of files.
I wanted your advice on what is the best way to implement it (PySpark isn't
a must).


Thanks,
Tzahi

Re: Merging Parquet Files

Posted by Michael Segel <ms...@hotmail.com>.

Hi, 

I think you’re asking the right question, however you’re making an assumption that he’s on the cloud and he never talked about the size of the file. 

It could be that he’s got a lot of small-ish data sets.  1GB is kinda small in relative terms.  

Again YMMV. 

Personally if you’re going to use Spark for data engineering,  Scala first, Java second, then Python unless you’re a Python developer which means go w Python. 

I agree that wanting to have a single file needs to be explained. 

> On Aug 31, 2020, at 10:52 AM, Jörn Franke <jo...@gmail.com> wrote:
> 
> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files. The reason is also that if you need to transfer it (eg to other clouds etc) - having a large file of several terabytes is bad.
> 
> It depends on your use case but you might look also at partitions etc.
> 
>> Am 31.08.2020 um 16:17 schrieb Tzahi File <tz...@ironsrc.com>:
>> 
>> 
>> Hi, 
>> 
>> I would like to develop a process that merges parquet files. 
>> My first intention was to develop it with PySpark using coalesce(1) -  to create only 1 file. 
>> This process is going to run on a huge amount of files.
>> I wanted your advice on what is the best way to implement it (PySpark isn't a must).  
>> 
>> 
>> Thanks,
>> Tzahi

Re: Merging Parquet Files

Posted by Tzahi File <tz...@ironsrc.com>.

You are right.

In general this job should deal with very small files and create an output
file of less than 100MB.
In other cases I would need to create multiple files of around 100 MB..
The issues with partitions that decrease the number of partitions will
reduce ETLs performance, while this job should be a side job.

On Mon, Aug 31, 2020 at 5:52 PM Jörn Franke <jo...@gmail.com> wrote:

> Why only one file?
> I would go more for files of specific size, eg data is split in 1gb files.
> The reason is also that if you need to transfer it (eg to other clouds etc)
> - having a large file of several terabytes is bad.
>
> It depends on your use case but you might look also at partitions etc.
>
> Am 31.08.2020 um 16:17 schrieb Tzahi File <tz...@ironsrc.com>:
>
> 
> Hi,
>
> I would like to develop a process that merges parquet files.
> My first intention was to develop it with PySpark using coalesce(1) -  to
> create only 1 file.
> This process is going to run on a huge amount of files.
> I wanted your advice on what is the best way to implement it (PySpark
> isn't a must).
>
>
> Thanks,
> Tzahi
>
>

-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.file@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.

Re: Merging Parquet Files

Posted by Jörn Franke <jo...@gmail.com>.

Why only one file?
I would go more for files of specific size, eg data is split in 1gb files. The reason is also that if you need to transfer it (eg to other clouds etc) - having a large file of several terabytes is bad.

It depends on your use case but you might look also at partitions etc.

> Am 31.08.2020 um 16:17 schrieb Tzahi File <tz...@ironsrc.com>:
> 
> 
> Hi, 
> 
> I would like to develop a process that merges parquet files. 
> My first intention was to develop it with PySpark using coalesce(1) -  to create only 1 file. 
> This process is going to run on a huge amount of files.
> I wanted your advice on what is the best way to implement it (PySpark isn't a must).  
> 
> 
> Thanks,
> Tzahi