You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Nezih Yigitbasi <ny...@netflix.com.INVALID> on 2016/04/01 19:03:24 UTC

Re: how about a custom coalesce() policy?

Hey Reynold,
Created an issue (and a PR) for this change to get discussions started.

Thanks,
Nezih

On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <rx...@databricks.com> wrote:

> Using the right email for Nezih
>
>
> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <rx...@databricks.com> wrote:
>
>> I think this can be useful.
>>
>> The only thing is that we are slowly migrating to the Dataset/DataFrame
>> API, and leave RDD mostly as is as a lower level API. Maybe we should do
>> both? In either case it would be great to discuss the API on a pull
>> request. Cheers.
>>
>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>> nyigitbasi@netflix.com.invalid> wrote:
>>
>>> Hi Spark devs,
>>>
>>> I have sent an email about my problem some time ago where I want to
>>> merge a large number of small files with Spark. Currently I am using Hive
>>> with the CombineHiveInputFormat and I can control the size of the
>>> output files with the max split size parameter (which is used for
>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>> attempt was to use coalesce(), but since coalesce only considers the
>>> target number of partitions the output file sizes were varying wildly.
>>>
>>> What I think can be useful is to have an optional PartitionCoalescer
>>> parameter (a new interface) in the coalesce() method (or maybe we can
>>> add a new method ?) that the callers can implement for custom coalescing
>>> strategies — for my use case I have already implemented a
>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>> their sizes and by using a max split size parameter, similar to the
>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access to
>>> the individual split sizes etc.).
>>>
>>> What do you guys think about such a change, can it be useful to other
>>> users as well? Or do you think that there is an easier way to accomplish
>>> the same merge logic? If you think it may be useful, I already have an
>>> implementation and I will be happy to work with the community to contribute
>>> it.
>>>
>>> Thanks,
>>> Nezih
>>> ​
>>>
>>
>>
>

Re: how about a custom coalesce() policy?

Posted by Nezih Yigitbasi <ny...@netflix.com.INVALID>.
Sure, here <https://issues.apache.org/jira/browse/SPARK-14042> is the jira
and this <https://github.com/apache/spark/pull/11865> is the PR.

Nezih

On Sat, Apr 2, 2016 at 10:40 PM Hemant Bhanawat <he...@gmail.com>
wrote:

> correcting email id for Nezih
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <he...@gmail.com>
> wrote:
>
>> Hi Nezih,
>>
>> Can you share JIRA and PR numbers?
>>
>> This partial de-coupling of data partitioning strategy and spark
>> parallelism would be a useful feature for any data store.
>>
>> Hemant
>>
>> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
>> www.snappydata.io
>>
>> On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <
>> nyigitbasi@netflix.com.invalid> wrote:
>>
>>> Hey Reynold,
>>> Created an issue (and a PR) for this change to get discussions started.
>>>
>>> Thanks,
>>> Nezih
>>>
>>> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> Using the right email for Nezih
>>>>
>>>>
>>>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <rx...@databricks.com>
>>>> wrote:
>>>>
>>>>> I think this can be useful.
>>>>>
>>>>> The only thing is that we are slowly migrating to the
>>>>> Dataset/DataFrame API, and leave RDD mostly as is as a lower level API.
>>>>> Maybe we should do both? In either case it would be great to discuss the
>>>>> API on a pull request. Cheers.
>>>>>
>>>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>>>>> nyigitbasi@netflix.com.invalid> wrote:
>>>>>
>>>>>> Hi Spark devs,
>>>>>>
>>>>>> I have sent an email about my problem some time ago where I want to
>>>>>> merge a large number of small files with Spark. Currently I am using Hive
>>>>>> with the CombineHiveInputFormat and I can control the size of the
>>>>>> output files with the max split size parameter (which is used for
>>>>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>>>>> attempt was to use coalesce(), but since coalesce only considers the
>>>>>> target number of partitions the output file sizes were varying wildly.
>>>>>>
>>>>>> What I think can be useful is to have an optional PartitionCoalescer
>>>>>> parameter (a new interface) in the coalesce() method (or maybe we
>>>>>> can add a new method ?) that the callers can implement for custom
>>>>>> coalescing strategies — for my use case I have already implemented a
>>>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>>>>> their sizes and by using a max split size parameter, similar to the
>>>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>>>>>> to the individual split sizes etc.).
>>>>>>
>>>>>> What do you guys think about such a change, can it be useful to other
>>>>>> users as well? Or do you think that there is an easier way to accomplish
>>>>>> the same merge logic? If you think it may be useful, I already have
>>>>>> an implementation and I will be happy to work with the community to
>>>>>> contribute it.
>>>>>>
>>>>>> Thanks,
>>>>>> Nezih
>>>>>> ​
>>>>>>
>>>>>
>>>>>
>>>>
>>
>

Re: how about a custom coalesce() policy?

Posted by Hemant Bhanawat <he...@gmail.com>.
correcting email id for Nezih

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Sun, Apr 3, 2016 at 11:09 AM, Hemant Bhanawat <he...@gmail.com>
wrote:

> Hi Nezih,
>
> Can you share JIRA and PR numbers?
>
> This partial de-coupling of data partitioning strategy and spark
> parallelism would be a useful feature for any data store.
>
> Hemant
>
> Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
> www.snappydata.io
>
> On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <
> nyigitbasi@netflix.com.invalid> wrote:
>
>> Hey Reynold,
>> Created an issue (and a PR) for this change to get discussions started.
>>
>> Thanks,
>> Nezih
>>
>> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <rx...@databricks.com> wrote:
>>
>>> Using the right email for Nezih
>>>
>>>
>>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <rx...@databricks.com>
>>> wrote:
>>>
>>>> I think this can be useful.
>>>>
>>>> The only thing is that we are slowly migrating to the Dataset/DataFrame
>>>> API, and leave RDD mostly as is as a lower level API. Maybe we should do
>>>> both? In either case it would be great to discuss the API on a pull
>>>> request. Cheers.
>>>>
>>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>>>> nyigitbasi@netflix.com.invalid> wrote:
>>>>
>>>>> Hi Spark devs,
>>>>>
>>>>> I have sent an email about my problem some time ago where I want to
>>>>> merge a large number of small files with Spark. Currently I am using Hive
>>>>> with the CombineHiveInputFormat and I can control the size of the
>>>>> output files with the max split size parameter (which is used for
>>>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>>>> attempt was to use coalesce(), but since coalesce only considers the
>>>>> target number of partitions the output file sizes were varying wildly.
>>>>>
>>>>> What I think can be useful is to have an optional PartitionCoalescer
>>>>> parameter (a new interface) in the coalesce() method (or maybe we can
>>>>> add a new method ?) that the callers can implement for custom coalescing
>>>>> strategies — for my use case I have already implemented a
>>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>>>> their sizes and by using a max split size parameter, similar to the
>>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>>>>> to the individual split sizes etc.).
>>>>>
>>>>> What do you guys think about such a change, can it be useful to other
>>>>> users as well? Or do you think that there is an easier way to accomplish
>>>>> the same merge logic? If you think it may be useful, I already have
>>>>> an implementation and I will be happy to work with the community to
>>>>> contribute it.
>>>>>
>>>>> Thanks,
>>>>> Nezih
>>>>> ​
>>>>>
>>>>
>>>>
>>>
>

Re: how about a custom coalesce() policy?

Posted by Hemant Bhanawat <he...@gmail.com>.
Hi Nezih,

Can you share JIRA and PR numbers?

This partial de-coupling of data partitioning strategy and spark
parallelism would be a useful feature for any data store.

Hemant

Hemant Bhanawat <https://www.linkedin.com/in/hemant-bhanawat-92a3811>
www.snappydata.io

On Fri, Apr 1, 2016 at 10:33 PM, Nezih Yigitbasi <
nyigitbasi@netflix.com.invalid> wrote:

> Hey Reynold,
> Created an issue (and a PR) for this change to get discussions started.
>
> Thanks,
> Nezih
>
> On Fri, Feb 26, 2016 at 12:03 AM Reynold Xin <rx...@databricks.com> wrote:
>
>> Using the right email for Nezih
>>
>>
>> On Fri, Feb 26, 2016 at 12:01 AM, Reynold Xin <rx...@databricks.com>
>> wrote:
>>
>>> I think this can be useful.
>>>
>>> The only thing is that we are slowly migrating to the Dataset/DataFrame
>>> API, and leave RDD mostly as is as a lower level API. Maybe we should do
>>> both? In either case it would be great to discuss the API on a pull
>>> request. Cheers.
>>>
>>> On Wed, Feb 24, 2016 at 2:08 PM, Nezih Yigitbasi <
>>> nyigitbasi@netflix.com.invalid> wrote:
>>>
>>>> Hi Spark devs,
>>>>
>>>> I have sent an email about my problem some time ago where I want to
>>>> merge a large number of small files with Spark. Currently I am using Hive
>>>> with the CombineHiveInputFormat and I can control the size of the
>>>> output files with the max split size parameter (which is used for
>>>> coalescing the input splits by the CombineHiveInputFormat). My first
>>>> attempt was to use coalesce(), but since coalesce only considers the
>>>> target number of partitions the output file sizes were varying wildly.
>>>>
>>>> What I think can be useful is to have an optional PartitionCoalescer
>>>> parameter (a new interface) in the coalesce() method (or maybe we can
>>>> add a new method ?) that the callers can implement for custom coalescing
>>>> strategies — for my use case I have already implemented a
>>>> SizeBasedPartitionCoalescer that coalesces partitions by looking at
>>>> their sizes and by using a max split size parameter, similar to the
>>>> CombineHiveInputFormat (I also had to expose HadoopRDD to get access
>>>> to the individual split sizes etc.).
>>>>
>>>> What do you guys think about such a change, can it be useful to other
>>>> users as well? Or do you think that there is an easier way to accomplish
>>>> the same merge logic? If you think it may be useful, I already have an
>>>> implementation and I will be happy to work with the community to contribute
>>>> it.
>>>>
>>>> Thanks,
>>>> Nezih
>>>> ​
>>>>
>>>
>>>
>>