You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Koert Kuipers <ko...@tresata.com> on 2018/08/06 01:06:54 UTC

Re: [DISCUSS][SQL] Control the number of output files

lukas,
what is the jira ticket for this? i would like to follow it's activity.
thanks!
koert

On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote:

> Hi,
> Yes, This feature is planned - Spark should be soon able to repartition
> output by size.
> Lukas
>
>
> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
> napsal:
>
>> Has there been any discussion to simply support Hive's merge small files
>> configuration? It simply adds one additional stage to inspect size of each
>> output file, recompute the desired parallelism to reach a target size, and
>> runs a map-only coalesce before committing the final files. Since AFAIK
>> SparkSQL already stages the final output commit, it seems feasible to
>> respect this Hive config.
>>
>> https://community.hortonworks.com/questions/106987/hive-
>> multiple-small-files.html
>>
>>
>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>> wrote:
>>
>>> See some of the related discussion under https://github.com/
>>> apache/spark/pull/21589
>>>
>>> If feels to me like we need some kind of user code mechanism to signal
>>> policy preferences to Spark. This could also include ways to signal
>>> scheduling policy, which could include things like scheduling pool and/or
>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>> different levels currently -- e.g. scheduling pools at the Job level
>>> (really, the thread local level in the current implementation) and barrier
>>> scheduling at the Stage level -- so it is not completely obvious how to
>>> unify all of these policy options/preferences/mechanism, or whether it is
>>> possible, but I think it is worth considering such things at a fairly high
>>> level of abstraction and try to unify and simplify before making things
>>> more complex with multiple policy mechanisms.
>>>
>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com> wrote:
>>>
>>>> Seems like a good idea in general. Do other systems have similar
>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>> there is any.
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Many Spark users in my company are asking for a way to control the
>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>> or increase the number. The users prefer not to use function
>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to write
>>>>> and deploy Scala/Java/Python code.
>>>>>
>>>>> Could we introduce a query hint for this purpose (similar to Broadcast
>>>>> Join Hints)?
>>>>>
>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>
>>>>> In general, is query hint is the best way to bring DF functionality to
>>>>> SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>
>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>> auto-merging output files.
>>>>>
>>>>> Thanks,
>>>>> John Zhuge
>>>>>
>>>>

Re: [DISCUSS][SQL] Control the number of output files

Posted by Koert Kuipers <ko...@tresata.com>.

we have found that to make shuffles reliable without OOMs we need to have
spark.sql.shuffle.partitions at a high number, bigger than 2000 at least.
yet this leads to a large amount of part files, which puts big pressure on
spark driver programs.

i tried to mitigate this with dataframe.coalesce to reduce the number of
files, but this is not acceptable. coalesce changes the tasks for the last
shuffle before it, bringing back the issues we tried to mitigate with a
high number for spark.sql.shuffle.partitions in the first place. doing a
dataframe.repartition before every write is also not an unacceptable
approach, it is too high a price to pay just to bring down the number of
files.

so i am very excited about any approach that efficiently merges files when
writing.



On Mon, Aug 6, 2018 at 5:28 PM, lukas nalezenec <lu...@apache.org> wrote:

> Hi Koert,
> There is no such Jira yet. We need SPARK-23889 before. You can find some
> mentions in the design document inside 23889.
> Best regards
> Lukas
>
> 2018-08-06 18:34 GMT+02:00 Koert Kuipers <ko...@tresata.com>:
>
>> i went through the jiras targeting 2.4.0 trying to find a feature where
>> spark would coalesce/repartition by size (so merge small files
>> automatically), but didn't find it.
>> can someone point me to it?
>> thank you.
>> best,
>> koert
>>
>> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> lukas,
>>> what is the jira ticket for this? i would like to follow it's activity.
>>> thanks!
>>> koert
>>>
>>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org>
>>> wrote:
>>>
>>>> Hi,
>>>> Yes, This feature is planned - Spark should be soon able to repartition
>>>> output by size.
>>>> Lukas
>>>>
>>>>
>>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>>>> napsal:
>>>>
>>>>> Has there been any discussion to simply support Hive's merge small
>>>>> files configuration? It simply adds one additional stage to inspect size of
>>>>> each output file, recompute the desired parallelism to reach a target size,
>>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>>> respect this Hive config.
>>>>>
>>>>> https://community.hortonworks.com/questions/106987/hive-mult
>>>>> iple-small-files.html
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>>>> wrote:
>>>>>
>>>>>> See some of the related discussion under https://github.com/apach
>>>>>> e/spark/pull/21589
>>>>>>
>>>>>> If feels to me like we need some kind of user code mechanism to
>>>>>> signal policy preferences to Spark. This could also include ways to signal
>>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>>> (really, the thread local level in the current implementation) and barrier
>>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>>> level of abstraction and try to unify and simplify before making things
>>>>>> more complex with multiple policy mechanisms.
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>>>> there is any.
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>>>> or increase the number. The users prefer not to use function
>>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>>
>>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>>> Broadcast Join Hints)?
>>>>>>>>
>>>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>>>
>>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>>>
>>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>>> auto-merging output files.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> John Zhuge
>>>>>>>>
>>>>>>>
>>>
>>
>

Re: [DISCUSS][SQL] Control the number of output files

Posted by lukas nalezenec <lu...@apache.org>.

Hi Koert,
There is no such Jira yet. We need SPARK-23889 before. You can find some
mentions in the design document inside 23889.
Best regards
Lukas

2018-08-06 18:34 GMT+02:00 Koert Kuipers <ko...@tresata.com>:

> i went through the jiras targeting 2.4.0 trying to find a feature where
> spark would coalesce/repartition by size (so merge small files
> automatically), but didn't find it.
> can someone point me to it?
> thank you.
> best,
> koert
>
> On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org>
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>>> napsal:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-mult
>>>> iple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> See some of the related discussion under https://github.com/apach
>>>>> e/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>

Re: [DISCUSS][SQL] Control the number of output files

Posted by Koert Kuipers <ko...@tresata.com>.

i went through the jiras targeting 2.4.0 trying to find a feature where
spark would coalesce/repartition by size (so merge small files
automatically), but didn't find it.
can someone point me to it?
thank you.
best,
koert

On Sun, Aug 5, 2018 at 9:06 PM, Koert Kuipers <ko...@tresata.com> wrote:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>> https://community.hortonworks.com/questions/106987/hive-mult
>>> iple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> See some of the related discussion under https://github.com/apach
>>>> e/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>

Re: [DISCUSS][SQL] Control the number of output files

Posted by John Zhuge <jz...@apache.org>.

https://issues.apache.org/jira/browse/SPARK-24940

The PR has been merged to 2.4.0.

On Sun, Aug 5, 2018 at 6:06 PM Koert Kuipers <ko...@tresata.com> wrote:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>>
>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> See some of the related discussion under
>>>> https://github.com/apache/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>

-- 
John Zhuge

Re: [DISCUSS][SQL] Control the number of output files

Posted by John Zhuge <jz...@apache.org>.

Great help from the community!

On Sun, Aug 5, 2018 at 6:17 PM Xiao Li <ga...@gmail.com> wrote:

> FYI, the new hints have been merged. They will be available in the
> upcoming release (Spark 2.4).
>
> *John Zhuge*, thanks for your work! Really appreciate it! Please submit
> more PRs and help the community improve Spark. : )
>
> Xiao
>
> 2018-08-05 21:06 GMT-04:00 Koert Kuipers <ko...@tresata.com>:
>
>> lukas,
>> what is the jira ticket for this? i would like to follow it's activity.
>> thanks!
>> koert
>>
>> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org>
>> wrote:
>>
>>> Hi,
>>> Yes, This feature is planned - Spark should be soon able to repartition
>>> output by size.
>>> Lukas
>>>
>>>
>>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>>> napsal:
>>>
>>>> Has there been any discussion to simply support Hive's merge small
>>>> files configuration? It simply adds one additional stage to inspect size of
>>>> each output file, recompute the desired parallelism to reach a target size,
>>>> and runs a map-only coalesce before committing the final files. Since AFAIK
>>>> SparkSQL already stages the final output commit, it seems feasible to
>>>> respect this Hive config.
>>>>
>>>>
>>>> https://community.hortonworks.com/questions/106987/hive-multiple-small-files.html
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> See some of the related discussion under
>>>>> https://github.com/apache/spark/pull/21589
>>>>>
>>>>> If feels to me like we need some kind of user code mechanism to signal
>>>>> policy preferences to Spark. This could also include ways to signal
>>>>> scheduling policy, which could include things like scheduling pool and/or
>>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>>> (really, the thread local level in the current implementation) and barrier
>>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>>> possible, but I think it is worth considering such things at a fairly high
>>>>> level of abstraction and try to unify and simplify before making things
>>>>> more complex with multiple policy mechanisms.
>>>>>
>>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> Seems like a good idea in general. Do other systems have similar
>>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>>> there is any.
>>>>>>
>>>>>>
>>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>>> or increase the number. The users prefer not to use function
>>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>>> write and deploy Scala/Java/Python code.
>>>>>>>
>>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>>> Broadcast Join Hints)?
>>>>>>>
>>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>>
>>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>>
>>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>>> auto-merging output files.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> John Zhuge
>>>>>>>
>>>>>>
>>
>

-- 
John Zhuge

Re: [DISCUSS][SQL] Control the number of output files

Posted by Xiao Li <ga...@gmail.com>.

FYI, the new hints have been merged. They will be available in the upcoming
release (Spark 2.4).

*John Zhuge*, thanks for your work! Really appreciate it! Please submit
more PRs and help the community improve Spark. : )

Xiao

2018-08-05 21:06 GMT-04:00 Koert Kuipers <ko...@tresata.com>:

> lukas,
> what is the jira ticket for this? i would like to follow it's activity.
> thanks!
> koert
>
> On Wed, Jul 25, 2018 at 5:32 PM, lukas nalezenec <lu...@apache.org> wrote:
>
>> Hi,
>> Yes, This feature is planned - Spark should be soon able to repartition
>> output by size.
>> Lukas
>>
>>
>> Dne st 25. 7. 2018 23:26 uživatel Forest Fang <fo...@outlook.com>
>> napsal:
>>
>>> Has there been any discussion to simply support Hive's merge small files
>>> configuration? It simply adds one additional stage to inspect size of each
>>> output file, recompute the desired parallelism to reach a target size, and
>>> runs a map-only coalesce before committing the final files. Since AFAIK
>>> SparkSQL already stages the final output commit, it seems feasible to
>>> respect this Hive config.
>>>
>>> https://community.hortonworks.com/questions/106987/hive-mult
>>> iple-small-files.html
>>>
>>>
>>> On Wed, Jul 25, 2018 at 1:55 PM Mark Hamstra <ma...@clearstorydata.com>
>>> wrote:
>>>
>>>> See some of the related discussion under https://github.com/apach
>>>> e/spark/pull/21589
>>>>
>>>> If feels to me like we need some kind of user code mechanism to signal
>>>> policy preferences to Spark. This could also include ways to signal
>>>> scheduling policy, which could include things like scheduling pool and/or
>>>> barrier scheduling. Some of those scheduling policies operate at inherently
>>>> different levels currently -- e.g. scheduling pools at the Job level
>>>> (really, the thread local level in the current implementation) and barrier
>>>> scheduling at the Stage level -- so it is not completely obvious how to
>>>> unify all of these policy options/preferences/mechanism, or whether it is
>>>> possible, but I think it is worth considering such things at a fairly high
>>>> level of abstraction and try to unify and simplify before making things
>>>> more complex with multiple policy mechanisms.
>>>>
>>>> On Wed, Jul 25, 2018 at 1:37 PM Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> Seems like a good idea in general. Do other systems have similar
>>>>> concepts? In general it'd be easier if we can follow existing convention if
>>>>> there is any.
>>>>>
>>>>>
>>>>> On Wed, Jul 25, 2018 at 11:50 AM John Zhuge <jz...@apache.org> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Many Spark users in my company are asking for a way to control the
>>>>>> number of output files in Spark SQL. There are use cases to either reduce
>>>>>> or increase the number. The users prefer not to use function
>>>>>> *repartition*(n) or *coalesce*(n, shuffle) that require them to
>>>>>> write and deploy Scala/Java/Python code.
>>>>>>
>>>>>> Could we introduce a query hint for this purpose (similar to
>>>>>> Broadcast Join Hints)?
>>>>>>
>>>>>>     /*+ *COALESCE*(n, shuffle) */
>>>>>>
>>>>>> In general, is query hint is the best way to bring DF functionality
>>>>>> to SQL without extending SQL syntax? Any suggestion is highly appreciated.
>>>>>>
>>>>>> This requirement is not the same as SPARK-6221 that asked for
>>>>>> auto-merging output files.
>>>>>>
>>>>>> Thanks,
>>>>>> John Zhuge
>>>>>>
>>>>>
>