You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by vinay Bajaj <vb...@gmail.com> on 2014/02/19 10:59:07 UTC

Spark process locality

Hi

It will be very helpful if anyone could elaborate your ideas on
spark.locality.wait and multiple locality levels (process-local,
node-local, rack-local and then any) and what is the best configuration i
can achieve by modifying this wait and what is the difference between
process local and node local.

Thanks
Vinay Bajaj

Re: Spark process locality

Posted by Mayur Rustagi <ma...@gmail.com>.

No you cannot force RDD to a particular node.

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Fri, Feb 21, 2014 at 8:30 AM, dachuan <hd...@gmail.com> wrote:

> Mayur, is there any way to command each RDD's partition to be some node?
>
> The input data is usually stored in HDFS and has its own preferred
> locations. But I am just curious about it, whether we can force the RDD's
> partitions to be stored in this way regardless of how you are stored now.
>
> thanks.
>
>
> On Fri, Feb 21, 2014 at 11:00 AM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Using the storage tab on Spark Web UI you can find that.
>> Compression will help certainly !!!
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>
>>> Hi Mayur,
>>>
>>> Thanks a lot for very quick reply.
>>>
>>> I have few questions regarding RDD
>>> 1) how do I know RDD placement per machine as in which RDD data is
>>> cached at what location ?
>>> 2) how do I know total space taken by each RDD created by my
>>> program/module ?
>>> 3) does enabling compression on RDD help ?
>>>
>>> Thanks,
>>> Vinay
>>>
>>>
>>>
>>>
>>> On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>> > wrote:
>>>
>>>> Its highly likely that locality type will not become a bottleneck as
>>>> spark tries to schedule the tasks where the data is cached, 2 thing might
>>>> help
>>>> 1. Make sure you have enough memory to cache the whole data as a RDD,
>>>> keep in mind sometimes the RDD may be higher than just raw text as Java
>>>> objects may have overhead
>>>> 2. you can try and increase the replication factor of data, so that
>>>> data is available on all workers hence is faster to cache in other workers
>>>> if they already dont have it(in non-local cases per say).
>>>>
>>>> Regards
>>>> Mayur
>>>>
>>>> Mayur Rustagi
>>>> Ph: +919632149971
>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>> https://twitter.com/mayur_rustagi
>>>>
>>>>
>>>>
>>>> On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>>
>>>>> Hi Mayur
>>>>>
>>>>> I am trying to analyse the Apache logs which contains the traffic
>>>>> details. Basically trying to figure out the statistics on Data points such
>>>>> as total views from each country and unique URLs. And i have one cluster
>>>>> running with 4 workers and one master (total space 240GB and 96 cores). And
>>>>> i was trying some things to make it faster so was stuck with these locality
>>>>> type of the process.
>>>>>
>>>>> Regards
>>>>> Vinay Bajaj
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <
>>>>> mayur.rustagi@gmail.com> wrote:
>>>>>
>>>>>> Process local implies the data is cached on the same jvm as the task,
>>>>>> node local means its cached on the same system but not in the same jvm(on
>>>>>> some other core perhaps). Wait modification is a tune process depends on
>>>>>> your system configuration (memory vs disk vs network). I frankly never had
>>>>>> to modify it..can you share your usecase that is requiring you to do that?
>>>>>>
>>>>>> Mayur Rustagi
>>>>>> Ph: +919632149971
>>>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>>>> https://twitter.com/mayur_rustagi
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> It will be very helpful if anyone could elaborate your ideas on
>>>>>>> spark.locality.wait and multiple locality levels (process-local,
>>>>>>> node-local, rack-local and then any) and what is the best configuration i
>>>>>>> can achieve by modifying this wait and what is the difference
>>>>>>> between process local and node local.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Vinay Bajaj
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Dachuan Huang
> Cellphone: 614-390-7234
> 2015 Neil Avenue
> Ohio State University
> Columbus, Ohio
> U.S.A.
> 43210
>

Re: Spark process locality

Posted by dachuan <hd...@gmail.com>.

Mayur, is there any way to command each RDD's partition to be some node?

The input data is usually stored in HDFS and has its own preferred
locations. But I am just curious about it, whether we can force the RDD's
partitions to be stored in this way regardless of how you are stored now.

thanks.


On Fri, Feb 21, 2014 at 11:00 AM, Mayur Rustagi <ma...@gmail.com>wrote:

> Using the storage tab on Spark Web UI you can find that.
> Compression will help certainly !!!
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <vb...@gmail.com>wrote:
>
>> Hi Mayur,
>>
>> Thanks a lot for very quick reply.
>>
>> I have few questions regarding RDD
>> 1) how do I know RDD placement per machine as in which RDD data is cached
>> at what location ?
>> 2) how do I know total space taken by each RDD created by my
>> program/module ?
>> 3) does enabling compression on RDD help ?
>>
>> Thanks,
>> Vinay
>>
>>
>>
>>
>> On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> Its highly likely that locality type will not become a bottleneck as
>>> spark tries to schedule the tasks where the data is cached, 2 thing might
>>> help
>>> 1. Make sure you have enough memory to cache the whole data as a RDD,
>>> keep in mind sometimes the RDD may be higher than just raw text as Java
>>> objects may have overhead
>>> 2. you can try and increase the replication factor of data, so that data
>>> is available on all workers hence is faster to cache in other workers if
>>> they already dont have it(in non-local cases per say).
>>>
>>> Regards
>>> Mayur
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>
>>>> Hi Mayur
>>>>
>>>> I am trying to analyse the Apache logs which contains the traffic
>>>> details. Basically trying to figure out the statistics on Data points such
>>>> as total views from each country and unique URLs. And i have one cluster
>>>> running with 4 workers and one master (total space 240GB and 96 cores). And
>>>> i was trying some things to make it faster so was stuck with these locality
>>>> type of the process.
>>>>
>>>> Regards
>>>> Vinay Bajaj
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <
>>>> mayur.rustagi@gmail.com> wrote:
>>>>
>>>>> Process local implies the data is cached on the same jvm as the task,
>>>>> node local means its cached on the same system but not in the same jvm(on
>>>>> some other core perhaps). Wait modification is a tune process depends on
>>>>> your system configuration (memory vs disk vs network). I frankly never had
>>>>> to modify it..can you share your usecase that is requiring you to do that?
>>>>>
>>>>> Mayur Rustagi
>>>>> Ph: +919632149971
>>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>>> https://twitter.com/mayur_rustagi
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> It will be very helpful if anyone could elaborate your ideas on
>>>>>> spark.locality.wait and multiple locality levels (process-local,
>>>>>> node-local, rack-local and then any) and what is the best configuration i
>>>>>> can achieve by modifying this wait and what is the difference
>>>>>> between process local and node local.
>>>>>>
>>>>>> Thanks
>>>>>> Vinay Bajaj
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

Re: Spark process locality

Posted by Mayur Rustagi <ma...@gmail.com>.

Using the storage tab on Spark Web UI you can find that.
Compression will help certainly !!!

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Fri, Feb 21, 2014 at 12:09 AM, vinay Bajaj <vb...@gmail.com> wrote:

> Hi Mayur,
>
> Thanks a lot for very quick reply.
>
> I have few questions regarding RDD
> 1) how do I know RDD placement per machine as in which RDD data is cached
> at what location ?
> 2) how do I know total space taken by each RDD created by my
> program/module ?
> 3) does enabling compression on RDD help ?
>
> Thanks,
> Vinay
>
>
>
>
> On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Its highly likely that locality type will not become a bottleneck as
>> spark tries to schedule the tasks where the data is cached, 2 thing might
>> help
>> 1. Make sure you have enough memory to cache the whole data as a RDD,
>> keep in mind sometimes the RDD may be higher than just raw text as Java
>> objects may have overhead
>> 2. you can try and increase the replication factor of data, so that data
>> is available on all workers hence is faster to cache in other workers if
>> they already dont have it(in non-local cases per say).
>>
>> Regards
>> Mayur
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>
>>> Hi Mayur
>>>
>>> I am trying to analyse the Apache logs which contains the traffic
>>> details. Basically trying to figure out the statistics on Data points such
>>> as total views from each country and unique URLs. And i have one cluster
>>> running with 4 workers and one master (total space 240GB and 96 cores). And
>>> i was trying some things to make it faster so was stuck with these locality
>>> type of the process.
>>>
>>> Regards
>>> Vinay Bajaj
>>>
>>>
>>> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <mayur.rustagi@gmail.com
>>> > wrote:
>>>
>>>> Process local implies the data is cached on the same jvm as the task,
>>>> node local means its cached on the same system but not in the same jvm(on
>>>> some other core perhaps). Wait modification is a tune process depends on
>>>> your system configuration (memory vs disk vs network). I frankly never had
>>>> to modify it..can you share your usecase that is requiring you to do that?
>>>>
>>>> Mayur Rustagi
>>>> Ph: +919632149971
>>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>>> https://twitter.com/mayur_rustagi
>>>>
>>>>
>>>>
>>>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> It will be very helpful if anyone could elaborate your ideas on
>>>>> spark.locality.wait and multiple locality levels (process-local,
>>>>> node-local, rack-local and then any) and what is the best configuration i
>>>>> can achieve by modifying this wait and what is the difference between
>>>>> process local and node local.
>>>>>
>>>>> Thanks
>>>>> Vinay Bajaj
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Spark process locality

Posted by vinay Bajaj <vb...@gmail.com>.

Hi Mayur,

Thanks a lot for very quick reply.

I have few questions regarding RDD
1) how do I know RDD placement per machine as in which RDD data is cached
at what location ?
2) how do I know total space taken by each RDD created by my program/module
?
3) does enabling compression on RDD help ?

Thanks,
Vinay




On Thu, Feb 20, 2014 at 11:44 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> Its highly likely that locality type will not become a bottleneck as spark
> tries to schedule the tasks where the data is cached, 2 thing might help
> 1. Make sure you have enough memory to cache the whole data as a RDD, keep
> in mind sometimes the RDD may be higher than just raw text as Java objects
> may have overhead
> 2. you can try and increase the replication factor of data, so that data
> is available on all workers hence is faster to cache in other workers if
> they already dont have it(in non-local cases per say).
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vb...@gmail.com>wrote:
>
>> Hi Mayur
>>
>> I am trying to analyse the Apache logs which contains the traffic
>> details. Basically trying to figure out the statistics on Data points such
>> as total views from each country and unique URLs. And i have one cluster
>> running with 4 workers and one master (total space 240GB and 96 cores). And
>> i was trying some things to make it faster so was stuck with these locality
>> type of the process.
>>
>> Regards
>> Vinay Bajaj
>>
>>
>> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>>
>>> Process local implies the data is cached on the same jvm as the task,
>>> node local means its cached on the same system but not in the same jvm(on
>>> some other core perhaps). Wait modification is a tune process depends on
>>> your system configuration (memory vs disk vs network). I frankly never had
>>> to modify it..can you share your usecase that is requiring you to do that?
>>>
>>> Mayur Rustagi
>>> Ph: +919632149971
>>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>>> https://twitter.com/mayur_rustagi
>>>
>>>
>>>
>>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>>
>>>> Hi
>>>>
>>>> It will be very helpful if anyone could elaborate your ideas on
>>>> spark.locality.wait and multiple locality levels (process-local,
>>>> node-local, rack-local and then any) and what is the best configuration i
>>>> can achieve by modifying this wait and what is the difference between
>>>> process local and node local.
>>>>
>>>> Thanks
>>>> Vinay Bajaj
>>>>
>>>>
>>>>
>>>
>>
>

Re: Spark process locality

Posted by Mayur Rustagi <ma...@gmail.com>.

Its highly likely that locality type will not become a bottleneck as spark
tries to schedule the tasks where the data is cached, 2 thing might help
1. Make sure you have enough memory to cache the whole data as a RDD, keep
in mind sometimes the RDD may be higher than just raw text as Java objects
may have overhead
2. you can try and increase the replication factor of data, so that data is
available on all workers hence is faster to cache in other workers if they
already dont have it(in non-local cases per say).

Regards
Mayur

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi



On Thu, Feb 20, 2014 at 12:29 AM, vinay Bajaj <vb...@gmail.com> wrote:

> Hi Mayur
>
> I am trying to analyse the Apache logs which contains the traffic details.
> Basically trying to figure out the statistics on Data points such as total
> views from each country and unique URLs. And i have one cluster running
> with 4 workers and one master (total space 240GB and 96 cores). And i was
> trying some things to make it faster so was stuck with these locality type
> of the process.
>
> Regards
> Vinay Bajaj
>
>
> On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <ma...@gmail.com>wrote:
>
>> Process local implies the data is cached on the same jvm as the task,
>> node local means its cached on the same system but not in the same jvm(on
>> some other core perhaps). Wait modification is a tune process depends on
>> your system configuration (memory vs disk vs network). I frankly never had
>> to modify it..can you share your usecase that is requiring you to do that?
>>
>> Mayur Rustagi
>> Ph: +919632149971
>> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
>> https://twitter.com/mayur_rustagi
>>
>>
>>
>> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com>wrote:
>>
>>> Hi
>>>
>>> It will be very helpful if anyone could elaborate your ideas on
>>> spark.locality.wait and multiple locality levels (process-local,
>>> node-local, rack-local and then any) and what is the best configuration i
>>> can achieve by modifying this wait and what is the difference between
>>> process local and node local.
>>>
>>> Thanks
>>> Vinay Bajaj
>>>
>>>
>>>
>>
>

Re: Spark process locality

Posted by vinay Bajaj <vb...@gmail.com>.

Hi Mayur

I am trying to analyse the Apache logs which contains the traffic details.
Basically trying to figure out the statistics on Data points such as total
views from each country and unique URLs. And i have one cluster running
with 4 workers and one master (total space 240GB and 96 cores). And i was
trying some things to make it faster so was stuck with these locality type
of the process.

Regards
Vinay Bajaj

On Wed, Feb 19, 2014 at 11:34 PM, Mayur Rustagi <ma...@gmail.com>wrote:

> Process local implies the data is cached on the same jvm as the task, node
> local means its cached on the same system but not in the same jvm(on some
> other core perhaps). Wait modification is a tune process depends on your
> system configuration (memory vs disk vs network). I frankly never had to
> modify it..can you share your usecase that is requiring you to do that?
>
> Mayur Rustagi
> Ph: +919632149971
> h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com> wrote:
>
>> Hi
>>
>> It will be very helpful if anyone could elaborate your ideas on
>> spark.locality.wait and multiple locality levels (process-local,
>> node-local, rack-local and then any) and what is the best configuration i
>> can achieve by modifying this wait and what is the difference between
>> process local and node local.
>>
>> Thanks
>> Vinay Bajaj
>>
>>
>>
>

Re: Spark process locality

Posted by Patrick Wendell <pw...@gmail.com>.

I think these are fairly well explained in the user docs. Was there
something unclear that maybe we could update?

http://spark.incubator.apache.org/docs/latest/configuration.html

On Wed, Feb 19, 2014 at 10:04 AM, Mayur Rustagi <ma...@gmail.com> wrote:
> Process local implies the data is cached on the same jvm as the task, node
> local means its cached on the same system but not in the same jvm(on some
> other core perhaps). Wait modification is a tune process depends on your
> system configuration (memory vs disk vs network). I frankly never had to
> modify it..can you share your usecase that is requiring you to do that?
>
> Mayur Rustagi
> Ph: +919632149971
> http://www.sigmoidanalytics.com
> https://twitter.com/mayur_rustagi
>
>
>
> On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com> wrote:
>>
>> Hi
>>
>> It will be very helpful if anyone could elaborate your ideas on
>> spark.locality.wait and multiple locality levels (process-local, node-local,
>> rack-local and then any) and what is the best configuration i can achieve by
>> modifying this wait and what is the difference between process local and
>> node local.
>>
>> Thanks
>> Vinay Bajaj
>>
>>
>

Re: Spark process locality

Posted by Mayur Rustagi <ma...@gmail.com>.

Process local implies the data is cached on the same jvm as the task, node
local means its cached on the same system but not in the same jvm(on some
other core perhaps). Wait modification is a tune process depends on your
system configuration (memory vs disk vs network). I frankly never had to
modify it..can you share your usecase that is requiring you to do that?

Mayur Rustagi
Ph: +919632149971
h <https://twitter.com/mayur_rustagi>ttp://www.sigmoidanalytics.com
https://twitter.com/mayur_rustagi

On Wed, Feb 19, 2014 at 1:59 AM, vinay Bajaj <vb...@gmail.com> wrote:

> Hi
>
> It will be very helpful if anyone could elaborate your ideas on
> spark.locality.wait and multiple locality levels (process-local,
> node-local, rack-local and then any) and what is the best configuration i
> can achieve by modifying this wait and what is the difference between
> process local and node local.
>
> Thanks
> Vinay Bajaj
>
>
>