You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Yann Moisan <ya...@gmail.com> on 2018/11/13 20:28:20 UTC

[Spark SQL] Does Spark group small files

Hello,

I'm using Spark 2.3.1.

I have a job that reads 5.000 small parquet files into s3.

When I do a mapPartitions followed by a collect, only *278* tasks are used
(I would have expected 5000). Does Spark group small files ? If yes, what
is the threshold for grouping ? Is it configurable ? Any link to
corresponding source code ?

Rgds,

Yann.

Re: [Spark SQL] Does Spark group small files

Posted by Silvio Fiorito <si...@granturing.com>.
Yes, it does bin-packing for small files which is a good thing so you avoid having many small partitions especially if you’re writing this data back out (e.g. it’s compacting as you read). The default partition size is 128MB with a 4MB “cost” for opening files. You can configure this using the settings defined here: http://spark.apache.org/docs/latest/sql-performance-tuning.html#other-configuration-options

From: Yann Moisan <ya...@gmail.com>
Date: Tuesday, November 13, 2018 at 3:28 PM
To: "user@spark.apache.org" <us...@spark.apache.org>
Subject: [Spark SQL] Does Spark group small files

Hello,

I'm using Spark 2.3.1.

I have a job that reads 5.000 small parquet files into s3.

When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any link to corresponding source code ?

Rgds,

Yann.

RE: [Spark SQL] Does Spark group small files

Posted by "Lienhart, Pierre (DI IZ) - AF (ext)" <pi...@airfrance.fr>.
Hello Yann,

From my understanding, when reading small files Spark will group them and load the content of each batch into the same partition so you won’t end up with 1 partition per file resulting in a huge number of very small partitions. This behavior is controlled by the spark.files.maxPartitionBytes parameter set to 128 MiB by default. For example if you have only 8 MiB files on your file system, you will end up with partitions holding the content of 16 files. If your files are heavily compressed, it can result in pretty fat partitions of size spark.files.maxPartitionBytes/compression ratio.

I can’t give you a link to a specific source code snippet but this is my experience from working with a lot of small parquet files.

Regards,

Pierre

De : Yann Moisan [mailto:yamo93@gmail.com]
Envoyé : mardi 13 novembre 2018 21:28
À : user@spark.apache.org
Objet : [Spark SQL] Does Spark group small files

Hello,

I'm using Spark 2.3.1.

I have a job that reads 5.000 small parquet files into s3.

When I do a mapPartitions followed by a collect, only 278 tasks are used (I would have expected 5000). Does Spark group small files ? If yes, what is the threshold for grouping ? Is it configurable ? Any link to corresponding source code ?

Rgds,

Yann.

[https://poolsite.airfrance.fr/repository/logo_corporate/af_logo_ita.png]<http://www.airfrance.com>
--

Accédez aux meilleurs tarifs Air France, gérez vos réservations et enregistrez-vous en ligne sur http://www.airfrance.com
Find best Air France fares, manage your reservations and check in online at http://www.airfrance.com

________________________________
Les données et renseignements contenus dans ce message peuvent être de nature confidentielle et soumis au secret professionnel et sont destinés à l'usage exclusif du destinataire dont les coordonnées figurent ci-dessus. Si vous recevez cette communication par erreur, nous vous demandons de ne pas la copier, l'utiliser ou la divulguer. Nous vous prions de notifier cette erreur à l'expéditeur et d'effacer immédiatement cette communication de votre système. Société Air France - Société anonyme au capital de 126 748 775 euros - RCS Bobigny (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
The data and information contained in this message may be confidential and subject to professional secrecy and are intended for the exclusive use of the recipient at the address shown above. If you receive this message by mistake, we ask you not to copy, use or disclose it. Please notify this error to the sender immediately and delete this message from your system. Société Air France - Limited company with capital of 126,748,775 euros - Bobigny register of companies (France) 420 495 178 - 45, rue de Paris, Tremblay-en-France, 95747 Roissy Charles de Gaulle CEDEX
________________________________
Pensez à l'environnement avant d'imprimer ce message.
Think of the environment before printing this mail.