You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Anis Nasir <aa...@gmail.com> on 2017/02/14 08:01:53 UTC

Handling Skewness and Heterogeneity

Dear All,

I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who spark will handle such use cases.

Any help will be highly appreciated!


Regards,
Anis

Fwd: Handling Skewness and Heterogeneity

Posted by Anis Nasir <aa...@gmail.com>.

Dear all,

Can you please comment on the below mentioned use case.

Thanking you in advance

Regards,
Anis


---------- Forwarded message ---------
From: Anis Nasir <aa...@gmail.com>
Date: Tue, 14 Feb 2017 at 17:01
Subject: Handling Skewness and Heterogeneity
To: <us...@spark.apache.org>


Dear All,

I have few use cases for spark streaming where spark cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who spark will handle such use cases.

Any help will be highly appreciated!


Regards,
Anis

Re: Handling Skewness and Heterogeneity

Posted by Galen Marchetti <ga...@gmail.com>.

Anis,

If your random partitions are smaller than your smallest machine, and you
request executors for your spark jobs no larger than your smallest machine,
then spark/cluster manager will automatically assign many executors to your
larger machines.

As long as you request small executors, you will utilize your large boxes
effectively because they will run many more executors simultaneously than
the small boxes do.

On Tue, Feb 14, 2017 at 5:09 PM, Anis Nasir <aa...@gmail.com> wrote:

> Thank you very much for your reply.
>
> I guess this approach balances the load across the cluster of machines.
>
> However, I am looking for something for heterogeneous cluster for which
> the distribution is not known in prior.
>
> Cheers,
> Anis
>
>
> On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <ga...@gmail.com>
> wrote:
>
>> Anis,
>>
>> I've typically seen people handle skew by seeding the keys corresponding
>> to high volumes with random values, then partitioning the dataset based on
>> the original key *and* the random value, then reducing.
>>
>> Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>,
>> <random_digit>, <name> )
>>
>> This transformation reduces the size of the huge partition, making it
>> tenable for spark, as long as you can figure out logic for aggregating the
>> results of the seeded partitions together again.
>>
>> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aa...@gmail.com> wrote:
>>
>> Dear All,
>>
>> I have few use cases for spark streaming where spark cluster consist of
>> heterogenous machines.
>>
>> Additionally, there is skew present in both the input distribution (e.g.,
>> each tuple is drawn from a zipf distribution) and the service time (e.g.,
>> service time required for each tuple comes from a zipf distribution).
>>
>> I want to know who spark will handle such use cases.
>>
>> Any help will be highly appreciated!
>>
>>
>> Regards,
>> Anis
>>
>>
>>
>>
>>

Re: Handling Skewness and Heterogeneity

Posted by Anis Nasir <aa...@gmail.com>.

Thank you very much for your reply.

I guess this approach balances the load across the cluster of machines.

However, I am looking for something for heterogeneous cluster for which the
distribution is not known in prior.

Cheers,
Anis


On Tue, 14 Feb 2017 at 20:19, Galen Marchetti <ga...@gmail.com>
wrote:

> Anis,
>
> I've typically seen people handle skew by seeding the keys corresponding
> to high volumes with random values, then partitioning the dataset based on
> the original key *and* the random value, then reducing.
>
> Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>,
> <random_digit>, <name> )
>
> This transformation reduces the size of the huge partition, making it
> tenable for spark, as long as you can figure out logic for aggregating the
> results of the seeded partitions together again.
>
> On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aa...@gmail.com> wrote:
>
> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>
>
>

Re: Handling Skewness and Heterogeneity

Posted by Galen Marchetti <ga...@gmail.com>.

Anis,

I've typically seen people handle skew by seeding the keys corresponding to
high volumes with random values, then partitioning the dataset based on the
original key *and* the random value, then reducing.

Ex: ( <digits_in_salary>, <name> ) -> ( <digits_in_salary>, <random_digit>,
<name> )

This transformation reduces the size of the huge partition, making it
tenable for spark, as long as you can figure out logic for aggregating the
results of the seeded partitions together again.

On Tue, Feb 14, 2017 at 12:01 PM, Anis Nasir <aa...@gmail.com> wrote:

> Dear All,
>
> I have few use cases for spark streaming where spark cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who spark will handle such use cases.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>
>
>