You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Daniel Siegmann <ds...@securityscorecard.io> on 2017/05/16 20:12:54 UTC

Documentation on "Automatic file coalescing for native data sources"?

When using spark.read on a large number of small files, these are
automatically coalesced into fewer partitions. The only documentation I can
find on this is in the Spark 2.0.0 release notes, where it simply says (
http://spark.apache.org/releases/spark-release-2-0-0.html):

"Automatic file coalescing for native data sources"

Can anyone point me to documentation explaining what triggers this feature,
how it decides how many partitions to coalesce to, and what counts as a
"native data source"? I couldn't find any mention of this feature in the
SQL Programming Guide and Google was not helpful.

--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001

Re: Documentation on "Automatic file coalescing for native data sources"?

Posted by Daniel Siegmann <ds...@securityscorecard.io>.

Thanks for the help everyone.

It seems the automatic coalescing doesn't happen when accessing ORC data
through a Hive metastore unless you configure
spark.sql.hive.convertMetastoreOrc to be true (it is false by default). I'm
not sure if this is documented somewhere, or if there's any reason not to
enable it, but I haven't had any problem with it.


--
Daniel Siegmann
Senior Software Engineer
*SecurityScorecard Inc.*
214 W 29th Street, 5th Floor
New York, NY 10001


On Sat, May 20, 2017 at 9:14 PM, Kabeer Ahmed <ka...@gmx.co.uk> wrote:

> Thank you Takeshi.
>
> As far as I see from the code pointed, the default number of bytes to pack
> in a partition is set to 128MB - size of the parquet block size.
>
> Daniel,
>
> It seems you do have a need to modify the number of bytes you want to pack
> per partition. I am curious to know the scenario. Please share if you can.
>
> Thanks,
> Kabeer.
>
> On May 20 2017, at 4:54 pm, Takeshi Yamamuro <li...@gmail.com>
> wrote:
>
>> I think this document points to a logic here: https://github.com/
>> apache/spark/blob/master/sql/core/src/main/scala/org/
>> apache/spark/sql/execution/DataSourceScanExec.scala#L418
>> <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala?recipient=dsiegmann%40securityscorecard.io#L418>
>>
>> This logic merge small files into a partition and you can control this
>> threshold via `spark.sql.files.maxPartitionBytes`.
>>
>> // maropu
>>
>>
>> On Sat, May 20, 2017 at 8:15 AM, ayan guha <gu...@gmail.com> wrote:
>>
>> I think like all other read operations, it is driven by input format
>> used, and I think some variation of combine file input format is used by
>> default. I think you can test it by force a particular input format which
>> gets ine file per split, then you should end up with same number of
>> partitions as your dsta files
>>
>> On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aa...@gmail.com>
>> wrote:
>>
>> Hey all,
>>
>> A reply on this would be great!
>>
>> Thanks,
>> A.B.
>>
>> On 17-May-2017 1:43 AM, "Daniel Siegmann" <ds...@securityscorecard.io>
>> wrote:
>>
>> When using spark.read on a large number of small files, these are
>> automatically coalesced into fewer partitions. The only documentation I can
>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>> http://spark.apache.org/releases/spark-release-2-0-0.html
>> <http://spark.apache.org/releases/spark-release-2-0-0.html?recipient=dsiegmann%40securityscorecard.io>
>> ):
>>
>> "Automatic file coalescing for native data sources"
>>
>> Can anyone point me to documentation explaining what triggers this
>> feature, how it decides how many partitions to coalesce to, and what counts
>> as a "native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>>
>>
>>
>> --
>> ---
>> Takeshi Yamamuro
>>
>

Re: Documentation on "Automatic file coalescing for native data sources"?

Posted by Takeshi Yamamuro <li...@gmail.com>.

I think this document points to a logic here:
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L418

This logic merge small files into a partition and you can control this
threshold via `spark.sql.files.maxPartitionBytes`.

// maropu


On Sat, May 20, 2017 at 8:15 AM, ayan guha <gu...@gmail.com> wrote:

> I think like all other read operations, it is driven by input format used,
> and I think some variation of combine file input format is used by default.
> I think you can test it by force a particular input format which gets ine
> file per split, then you should end up with same number of partitions as
> your dsta files
>
> On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aa...@gmail.com>
> wrote:
>
>> Hey all,
>>
>> A reply on this would be great!
>>
>> Thanks,
>> A.B.
>>
>> On 17-May-2017 1:43 AM, "Daniel Siegmann" <ds...@securityscorecard.io>
>> wrote:
>>
>>> When using spark.read on a large number of small files, these are
>>> automatically coalesced into fewer partitions. The only documentation I can
>>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>>
>>> "Automatic file coalescing for native data sources"
>>>
>>> Can anyone point me to documentation explaining what triggers this
>>> feature, how it decides how many partitions to coalesce to, and what counts
>>> as a "native data source"? I couldn't find any mention of this feature in
>>> the SQL Programming Guide and Google was not helpful.
>>>
>>> --
>>> Daniel Siegmann
>>> Senior Software Engineer
>>> *SecurityScorecard Inc.*
>>> 214 W 29th Street, 5th Floor
>>> New York, NY 10001
>>>
>>> --
> Best Regards,
> Ayan Guha
>



-- 
---
Takeshi Yamamuro

Re: Documentation on "Automatic file coalescing for native data sources"?

Posted by ayan guha <gu...@gmail.com>.

I think like all other read operations, it is driven by input format used,
and I think some variation of combine file input format is used by default.
I think you can test it by force a particular input format which gets ine
file per split, then you should end up with same number of partitions as
your dsta files

On Sat, 20 May 2017 at 5:12 am, Aakash Basu <aa...@gmail.com>
wrote:

> Hey all,
>
> A reply on this would be great!
>
> Thanks,
> A.B.
>
> On 17-May-2017 1:43 AM, "Daniel Siegmann" <ds...@securityscorecard.io>
> wrote:
>
>> When using spark.read on a large number of small files, these are
>> automatically coalesced into fewer partitions. The only documentation I can
>> find on this is in the Spark 2.0.0 release notes, where it simply says (
>> http://spark.apache.org/releases/spark-release-2-0-0.html):
>>
>> "Automatic file coalescing for native data sources"
>>
>> Can anyone point me to documentation explaining what triggers this
>> feature, how it decides how many partitions to coalesce to, and what counts
>> as a "native data source"? I couldn't find any mention of this feature in
>> the SQL Programming Guide and Google was not helpful.
>>
>> --
>> Daniel Siegmann
>> Senior Software Engineer
>> *SecurityScorecard Inc.*
>> 214 W 29th Street, 5th Floor
>> New York, NY 10001
>>
>> --
Best Regards,
Ayan Guha

Re: Documentation on "Automatic file coalescing for native data sources"?

Posted by Aakash Basu <aa...@gmail.com>.

Hey all,

A reply on this would be great!

Thanks,
A.B.

On 17-May-2017 1:43 AM, "Daniel Siegmann" <ds...@securityscorecard.io>
wrote:

> When using spark.read on a large number of small files, these are
> automatically coalesced into fewer partitions. The only documentation I can
> find on this is in the Spark 2.0.0 release notes, where it simply says (
> http://spark.apache.org/releases/spark-release-2-0-0.html):
>
> "Automatic file coalescing for native data sources"
>
> Can anyone point me to documentation explaining what triggers this
> feature, how it decides how many partitions to coalesce to, and what counts
> as a "native data source"? I couldn't find any mention of this feature in
> the SQL Programming Guide and Google was not helpful.
>
> --
> Daniel Siegmann
> Senior Software Engineer
> *SecurityScorecard Inc.*
> 214 W 29th Street, 5th Floor
> New York, NY 10001
>
>