You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Io...@nomura.com on 2016/04/28 11:33:03 UTC

RDD.broadcast

Hi,

It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back.

Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap holding this broadcast.

I guess this can become complicated if the resulting broadcast is required to keep lineage information, but assuming a torrent distribution, once the broadcast is synced then lineage would not be required. I’d also expect the call to rdd.brodcast to be an action that eagerly distributes the broadcast and returns when the operation has succeeded.

Is this something that could be implemented or are there any reasons that prohibits this?

Thanks
Ioannis


This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm


Re: RDD.broadcast

Posted by Mike Hynes <91...@gmail.com>.
I second knowing the use case for interest. I can imagine a case where
knowledge of the RDD key distribution would help local computations, for
relaticely few keys, but would be interested to hear your motive.

Essentially, are you trying to achieve what would be an all-reduce type
operation in MPI?
On Apr 28, 2016 9:08 PM, "Marcin Tustin" <mt...@handybook.com> wrote:

> Why would you ever need to do this? I'm genuinely curious. I view
> collects as being solely for interactive work.
>
> On Thursday, April 28, 2016, <Io...@nomura.com> wrote:
>
>> Hi,
>>
>>
>>
>> It is a common pattern to process an RDD, collect (typically a subset) to
>> the driver and then broadcast back.
>>
>>
>>
>> Adding an RDD method that can do that using the torrent broadcast
>> mechanics would be much more efficient. In addition, it would not require
>> the Driver to also utilize its Heap holding this broadcast.
>>
>>
>>
>> I guess this can become complicated if the resulting broadcast is
>> required to keep lineage information, but assuming a torrent distribution,
>> once the broadcast is synced then lineage would not be required. I’d also
>> expect the call to rdd.brodcast to be an action that eagerly distributes
>> the broadcast and returns when the operation has succeeded.
>>
>>
>>
>> Is this something that could be implemented or are there any reasons that
>> prohibits this?
>>
>>
>>
>> Thanks
>>
>> Ioannis
>>
>> This e-mail (including any attachments) is private and confidential, may
>> contain proprietary or privileged information and is intended for the named
>> recipient(s) only. Unintended recipients are strictly prohibited from
>> taking action on the basis of information in this e-mail and must contact
>> the sender immediately, delete this e-mail (and all attachments) and
>> destroy any hard copies. Nomura will not accept responsibility or liability
>> for the accuracy or completeness of, or the presence of any virus or
>> disabling code in, this e-mail. If verification is sought please request a
>> hard copy. Any reference to the terms of executed transactions should be
>> treated as preliminary only and subject to formal written confirmation by
>> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
>> communications through its networks (subject to and in accordance with
>> applicable laws). No confidentiality or privilege is waived or lost by
>> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
>> a reference to any entity in the Nomura Holdings, Inc. group. Please read
>> our Electronic Communications Legal Notice which forms part of this e-mail:
>> http://www.Nomura.com/email_disclaimer.htm
>>
>
> Want to work at Handy? Check out our culture deck and open roles
> <http://www.handy.com/careers>
> Latest news <http://www.handy.com/press> at Handy
> Handy just raised $50m
> <http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led
> by Fidelity
>
>

Re: RDD.broadcast

Posted by Reynold Xin <rx...@databricks.com>.
This is a nice feature in broadcast join. It is just a little bit
complicated to do and as a result hasn't been prioritized as highly yet.


On Thu, Apr 28, 2016 at 5:51 AM, <Io...@nomura.com> wrote:

> I was aiming to show the operations with pseudo-code, but I apparently
> failed, so Java it is J
>
> Assume the following 3 datasets on HDFS.
>
> 1.       RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId,
> (extra stuff)
>
> 2.       RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2
> (extra stuff)
>
> 3.       RDD3: Locations (10 Millions rows – 10GB) Columns: locationId,
> (extra stuff)
>
>
>
> So, what I am doing is shown below.
>
> I assume that it is clear of why using a “join” would be really
> inefficient, but if not have a look of some of the reasons here
> <http://stackoverflow.com/questions/30412325/hoes-does-spark-schedule-a-join/30490619#30490619>
> .
>
>
>
> //1 Filter actions
>
>  > JavaRDD actionRdd = RDD2.filter(f->….); //300GB
>
>
>
> //2 Filter users & Broadcast
>
> > JavaRDD userRdd =  RDD1.filter(f->…); //1GB
>
> > Map m = userRdd.keyBy(f->g.getUid()).collectAsMap();
>
> > Broadcast<Map> userMap = sparkCtx.broadcast(m);//1GB
>
>
>
> //3 Filter locations & Broadcast
>
> > JavaPairRDD locations = RDD3.filter(f->….).keyBy(f->g.getLocationId());
>
> > Broadcast<Map> locationMap =
> sparkCtx.broadcast(locations.collectAsMap());//1GB
>
>
>
> //4 Process using data from all 3 data sets
>
> > actionRdd.map(f-> {
>
> User u1 = userMap.value.get( f.getUid1());
>
> User u2 = userMap.value.get( f.getUid2());
>
> Location l= locationMap.value.get(u.getLocationId();
>
> Object result = method(f,u1,u2,l);//method implementation not important,
> but requires all 3 objects
>
> return result;
>
>           });
>
>
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 28 April 2016 12:27
>
> *To:* Deligiannis, Ioannis (UK)
> *Cc:* dev@spark.apache.org
> *Subject:* Re: RDD.broadcast
>
>
>
> I don't know what your notation really means. I'm very much unclear on why
> you can't use the filter method for 1. If you're talking about
> splitting/bucketing rather filtering as such I think that is a specific
> lacuna in spark's Api.
>
>
>
> I've generally found the join api to be entirely adequate for my needs, so
> I don't really have a comment on 2.
>
>
> On Thursday, April 28, 2016, <Io...@nomura.com> wrote:
>
> One example pattern we have it doing joins or filters based on two
> datasets. E.g.
>
> 1         Filter –multiple- RddB for a given set extracted from RddA
> (keyword here is multiple times)
>
> a.       RddA -> keyBy -> distinct -> collect() to Set A;
>
> b.      RddB -> Filter using Set A;
>
> 2         “Join” using composition on executor (again multiple times)
>
> a.       RddA -> filter by XYZ -> keyBy join attribute -> collectAsMap
> ->Broadcast MapA;
>
> b.      RddB -> map (Broadcast<Map<K,V>> MapA;
>
>
>
> The first use case might not be that common, but joining a large RDD with
> a small (reference) RDD is quite common and much faster than using “join”
> method.
>
>
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com]
> *Sent:* 28 April 2016 12:08
> *To:* Deligiannis, Ioannis (UK)
> *Cc:* dev@spark.apache.org
> *Subject:* Re: RDD.broadcast
>
>
>
> Why would you ever need to do this? I'm genuinely curious. I view collects
> as being solely for interactive work.
>
> On Thursday, April 28, 2016, <Io...@nomura.com> wrote:
>
> Hi,
>
>
>
> It is a common pattern to process an RDD, collect (typically a subset) to
> the driver and then broadcast back.
>
>
>
> Adding an RDD method that can do that using the torrent broadcast
> mechanics would be much more efficient. In addition, it would not require
> the Driver to also utilize its Heap holding this broadcast.
>
>
>
> I guess this can become complicated if the resulting broadcast is required
> to keep lineage information, but assuming a torrent distribution, once the
> broadcast is synced then lineage would not be required. I’d also expect the
> call to rdd.brodcast to be an action that eagerly distributes the broadcast
> and returns when the operation has succeeded.
>
>
>
> Is this something that could be implemented or are there any reasons that
> prohibits this?
>
>
>
> Thanks
>
> Ioannis
>
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=SLnOgTBJ2zAlhtvjcFRXfqUArds-4HSAZCgFXLgMCVY&e=>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=WgDnCrSGv_qt66f2cabjugmMGU46gc5rSkt_gm7lEkQ&e=>
>
> Latest news
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=rfQxr8cDwVFK7Mql1_HdnvqAmXeiOHZgnjNtKXGn_Kg&e=> at
> Handy
>
> Handy just raised $50m
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=RbQTDcalISb9w2WMxzXmRgR1mr7QiCaqpD2bLAkt-z4&e=> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=zRDuW7ua96biT1hpVLmf9v2kyB0mV0SzsqtLLTbsRl0&e=>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=x2weVTVWoKPax6nWFBzezd9rEyFBLCr1tgL0NYQ8CkY&e=>
>
> Latest news
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=EX4TthvT0szzJSv8UOAPWUBq0vQMseqWuXColEdW9jw&e=> at
> Handy
>
> Handy just raised $50m
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=vIRLq79Lr1dBdabX-d9eAi1uJsbpRVx50slHOT11RoQ&e=> led
> by Fidelity
>
>
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>

RE: RDD.broadcast

Posted by Io...@nomura.com.
I was aiming to show the operations with pseudo-code, but I apparently failed, so Java it is ☺
Assume the following 3 datasets on HDFS.

1.       RDD1: User (1 Million rows – 2GB ) Columns: uid, locationId, (extra stuff)

2.       RDD2: Actions (1 Billion rows – 500GB) Columns: uid_1, uid_2 (extra stuff)

3.       RDD3: Locations (10 Millions rows – 10GB) Columns: locationId, (extra stuff)

So, what I am doing is shown below.
I assume that it is clear of why using a “join” would be really inefficient, but if not have a look of some of the reasons here<http://stackoverflow.com/questions/30412325/hoes-does-spark-schedule-a-join/30490619#30490619> .

//1 Filter actions
 > JavaRDD actionRdd = RDD2.filter(f->….); //300GB

//2 Filter users & Broadcast
> JavaRDD userRdd =  RDD1.filter(f->…); //1GB
> Map m = userRdd.keyBy(f->g.getUid()).collectAsMap();
> Broadcast<Map> userMap = sparkCtx.broadcast(m);//1GB

//3 Filter locations & Broadcast
> JavaPairRDD locations = RDD3.filter(f->….).keyBy(f->g.getLocationId());
> Broadcast<Map> locationMap = sparkCtx.broadcast(locations.collectAsMap());//1GB

//4 Process using data from all 3 data sets
> actionRdd.map(f-> {
User u1 = userMap.value.get( f.getUid1());
User u2 = userMap.value.get( f.getUid2());
Location l= locationMap.value.get(u.getLocationId();
Object result = method(f,u1,u2,l);//method implementation not important, but requires all 3 objects
return result;
          });


From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 28 April 2016 12:27
To: Deligiannis, Ioannis (UK)
Cc: dev@spark.apache.org
Subject: Re: RDD.broadcast

I don't know what your notation really means. I'm very much unclear on why you can't use the filter method for 1. If you're talking about splitting/bucketing rather filtering as such I think that is a specific lacuna in spark's Api.

I've generally found the join api to be entirely adequate for my needs, so I don't really have a comment on 2.

On Thursday, April 28, 2016, <Io...@nomura.com>> wrote:
One example pattern we have it doing joins or filters based on two datasets. E.g.

1         Filter –multiple- RddB for a given set extracted from RddA (keyword here is multiple times)

a.       RddA -> keyBy -> distinct -> collect() to Set A;

b.      RddB -> Filter using Set A;

2         “Join” using composition on executor (again multiple times)

a.       RddA -> filter by XYZ -> keyBy join attribute -> collectAsMap ->Broadcast MapA;

b.      RddB -> map (Broadcast<Map<K,V>> MapA;


The first use case might not be that common, but joining a large RDD with a small (reference) RDD is quite common and much faster than using “join” method.


From: Marcin Tustin [mailto:mtustin@handybook.com<javascript:_e(%7B%7D,'cvml','mtustin@handybook.com');>]
Sent: 28 April 2016 12:08
To: Deligiannis, Ioannis (UK)
Cc: dev@spark.apache.org<javascript:_e(%7B%7D,'cvml','dev@spark.apache.org');>
Subject: Re: RDD.broadcast

Why would you ever need to do this? I'm genuinely curious. I view collects as being solely for interactive work.

On Thursday, April 28, 2016, <Ioannis.Deligiannis@nomura.com<javascript:_e(%7B%7D,'cvml','Ioannis.Deligiannis@nomura.com');>> wrote:
Hi,

It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back.

Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap holding this broadcast.

I guess this can become complicated if the resulting broadcast is required to keep lineage information, but assuming a torrent distribution, once the broadcast is synced then lineage would not be required. I’d also expect the call to rdd.brodcast to be an action that eagerly distributes the broadcast and returns when the operation has succeeded.

Is this something that could be implemented or are there any reasons that prohibits this?

Thanks
Ioannis

This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=SLnOgTBJ2zAlhtvjcFRXfqUArds-4HSAZCgFXLgMCVY&e=>

Want to work at Handy? Check out our culture deck and open roles<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=WgDnCrSGv_qt66f2cabjugmMGU46gc5rSkt_gm7lEkQ&e=>
Latest news<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=rfQxr8cDwVFK7Mql1_HdnvqAmXeiOHZgnjNtKXGn_Kg&e=> at Handy
Handy just raised $50m<https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=RbQTDcalISb9w2WMxzXmRgR1mr7QiCaqpD2bLAkt-z4&e=> led by Fidelity

[Image removed by sender.]

This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=zRDuW7ua96biT1hpVLmf9v2kyB0mV0SzsqtLLTbsRl0&e=>

Want to work at Handy? Check out our culture deck and open roles<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=x2weVTVWoKPax6nWFBzezd9rEyFBLCr1tgL0NYQ8CkY&e=>
Latest news<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=EX4TthvT0szzJSv8UOAPWUBq0vQMseqWuXColEdW9jw&e=> at Handy
Handy just raised $50m<https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=3sSvB-R66BWR0E56L12fHFtbHcZLeqp2hdnrMncx1-U&s=vIRLq79Lr1dBdabX-d9eAi1uJsbpRVx50slHOT11RoQ&e=> led by Fidelity

[http://marketing-email-assets.handybook.com/smalllogo.png]


This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm


Re: RDD.broadcast

Posted by Marcin Tustin <mt...@handybook.com>.
I don't know what your notation really means. I'm very much unclear on why
you can't use the filter method for 1. If you're talking about
splitting/bucketing rather filtering as such I think that is a specific
lacuna in spark's Api.

I've generally found the join api to be entirely adequate for my needs, so
I don't really have a comment on 2.

On Thursday, April 28, 2016, <Io...@nomura.com> wrote:

> One example pattern we have it doing joins or filters based on two
> datasets. E.g.
>
> 1         Filter –multiple- RddB for a given set extracted from RddA
> (keyword here is multiple times)
>
> a.       RddA -> keyBy -> distinct -> collect() to Set A;
>
> b.      RddB -> Filter using Set A;
>
> 2         “Join” using composition on executor (again multiple times)
>
> a.       RddA -> filter by XYZ -> keyBy join attribute -> collectAsMap
> ->Broadcast MapA;
>
> b.      RddB -> map (Broadcast<Map<K,V>> MapA;
>
>
>
> The first use case might not be that common, but joining a large RDD with
> a small (reference) RDD is quite common and much faster than using “join”
> method.
>
>
>
>
>
> *From:* Marcin Tustin [mailto:mtustin@handybook.com
> <javascript:_e(%7B%7D,'cvml','mtustin@handybook.com');>]
> *Sent:* 28 April 2016 12:08
> *To:* Deligiannis, Ioannis (UK)
> *Cc:* dev@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','dev@spark.apache.org');>
> *Subject:* Re: RDD.broadcast
>
>
>
> Why would you ever need to do this? I'm genuinely curious. I view collects
> as being solely for interactive work.
>
> On Thursday, April 28, 2016, <Ioannis.Deligiannis@nomura.com
> <javascript:_e(%7B%7D,'cvml','Ioannis.Deligiannis@nomura.com');>> wrote:
>
> Hi,
>
>
>
> It is a common pattern to process an RDD, collect (typically a subset) to
> the driver and then broadcast back.
>
>
>
> Adding an RDD method that can do that using the torrent broadcast
> mechanics would be much more efficient. In addition, it would not require
> the Driver to also utilize its Heap holding this broadcast.
>
>
>
> I guess this can become complicated if the resulting broadcast is required
> to keep lineage information, but assuming a torrent distribution, once the
> broadcast is synced then lineage would not be required. I’d also expect the
> call to rdd.brodcast to be an action that eagerly distributes the broadcast
> and returns when the operation has succeeded.
>
>
>
> Is this something that could be implemented or are there any reasons that
> prohibits this?
>
>
>
> Thanks
>
> Ioannis
>
>
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=SLnOgTBJ2zAlhtvjcFRXfqUArds-4HSAZCgFXLgMCVY&e=>
>
>
>
> Want to work at Handy? Check out our culture deck and open roles
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=WgDnCrSGv_qt66f2cabjugmMGU46gc5rSkt_gm7lEkQ&e=>
>
> Latest news
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=rfQxr8cDwVFK7Mql1_HdnvqAmXeiOHZgnjNtKXGn_Kg&e=> at
> Handy
>
> Handy just raised $50m
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=RbQTDcalISb9w2WMxzXmRgR1mr7QiCaqpD2bLAkt-z4&e=> led
> by Fidelity
>
>
>
> [image: Image removed by sender.]
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity


RE: RDD.broadcast

Posted by Io...@nomura.com.
One example pattern we have it doing joins or filters based on two datasets. E.g.

1         Filter –multiple- RddB for a given set extracted from RddA (keyword here is multiple times)

a.       RddA -> keyBy -> distinct -> collect() to Set A;

b.      RddB -> Filter using Set A;

2         “Join” using composition on executor (again multiple times)

a.       RddA -> filter by XYZ -> keyBy join attribute -> collectAsMap ->Broadcast MapA;

b.      RddB -> map (Broadcast<Map<K,V>> MapA;


The first use case might not be that common, but joining a large RDD with a small (reference) RDD is quite common and much faster than using “join” method.


From: Marcin Tustin [mailto:mtustin@handybook.com]
Sent: 28 April 2016 12:08
To: Deligiannis, Ioannis (UK)
Cc: dev@spark.apache.org
Subject: Re: RDD.broadcast

Why would you ever need to do this? I'm genuinely curious. I view collects as being solely for interactive work.

On Thursday, April 28, 2016, <Io...@nomura.com>> wrote:
Hi,

It is a common pattern to process an RDD, collect (typically a subset) to the driver and then broadcast back.

Adding an RDD method that can do that using the torrent broadcast mechanics would be much more efficient. In addition, it would not require the Driver to also utilize its Heap holding this broadcast.

I guess this can become complicated if the resulting broadcast is required to keep lineage information, but assuming a torrent distribution, once the broadcast is synced then lineage would not be required. I’d also expect the call to rdd.brodcast to be an action that eagerly distributes the broadcast and returns when the operation has succeeded.

Is this something that could be implemented or are there any reasons that prohibits this?

Thanks
Ioannis

This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.Nomura.com_email-5Fdisclaimer.htm&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=SLnOgTBJ2zAlhtvjcFRXfqUArds-4HSAZCgFXLgMCVY&e=>

Want to work at Handy? Check out our culture deck and open roles<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_careers&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=WgDnCrSGv_qt66f2cabjugmMGU46gc5rSkt_gm7lEkQ&e=>
Latest news<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.handy.com_press&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=rfQxr8cDwVFK7Mql1_HdnvqAmXeiOHZgnjNtKXGn_Kg&e=> at Handy
Handy just raised $50m<https://urldefense.proofpoint.com/v2/url?u=http-3A__venturebeat.com_2015_11_02_on-2Ddemand-2Dhome-2Dservice-2Dhandy-2Draises-2D50m-2Din-2Dround-2Dled-2Dby-2Dfidelity_&d=CwMFaQ&c=dCBwIlVXJsYZrY6gpNt0LA&r=B8E4n9FrSS85mPCi6Mfs7cyEPQnVrpcQ1zeB-JKws6A&m=GAA5LZhuKEWXxozKzXPhWAYY4BSTpcXaf2lFg5JSPB0&s=RbQTDcalISb9w2WMxzXmRgR1mr7QiCaqpD2bLAkt-z4&e=> led by Fidelity

[Image removed by sender.]


This e-mail (including any attachments) is private and confidential, may contain proprietary or privileged information and is intended for the named recipient(s) only. Unintended recipients are strictly prohibited from taking action on the basis of information in this e-mail and must contact the sender immediately, delete this e-mail (and all attachments) and destroy any hard copies. Nomura will not accept responsibility or liability for the accuracy or completeness of, or the presence of any virus or disabling code in, this e-mail. If verification is sought please request a hard copy. Any reference to the terms of executed transactions should be treated as preliminary only and subject to formal written confirmation by Nomura. Nomura reserves the right to retain, monitor and intercept e-mail communications through its networks (subject to and in accordance with applicable laws). No confidentiality or privilege is waived or lost by Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is a reference to any entity in the Nomura Holdings, Inc. group. Please read our Electronic Communications Legal Notice which forms part of this e-mail: http://www.Nomura.com/email_disclaimer.htm


Re: RDD.broadcast

Posted by Marcin Tustin <mt...@handybook.com>.
Why would you ever need to do this? I'm genuinely curious. I view collects
as being solely for interactive work.

On Thursday, April 28, 2016, <Io...@nomura.com> wrote:

> Hi,
>
>
>
> It is a common pattern to process an RDD, collect (typically a subset) to
> the driver and then broadcast back.
>
>
>
> Adding an RDD method that can do that using the torrent broadcast
> mechanics would be much more efficient. In addition, it would not require
> the Driver to also utilize its Heap holding this broadcast.
>
>
>
> I guess this can become complicated if the resulting broadcast is required
> to keep lineage information, but assuming a torrent distribution, once the
> broadcast is synced then lineage would not be required. I’d also expect the
> call to rdd.brodcast to be an action that eagerly distributes the broadcast
> and returns when the operation has succeeded.
>
>
>
> Is this something that could be implemented or are there any reasons that
> prohibits this?
>
>
>
> Thanks
>
> Ioannis
>
> This e-mail (including any attachments) is private and confidential, may
> contain proprietary or privileged information and is intended for the named
> recipient(s) only. Unintended recipients are strictly prohibited from
> taking action on the basis of information in this e-mail and must contact
> the sender immediately, delete this e-mail (and all attachments) and
> destroy any hard copies. Nomura will not accept responsibility or liability
> for the accuracy or completeness of, or the presence of any virus or
> disabling code in, this e-mail. If verification is sought please request a
> hard copy. Any reference to the terms of executed transactions should be
> treated as preliminary only and subject to formal written confirmation by
> Nomura. Nomura reserves the right to retain, monitor and intercept e-mail
> communications through its networks (subject to and in accordance with
> applicable laws). No confidentiality or privilege is waived or lost by
> Nomura by any mistransmission of this e-mail. Any reference to "Nomura" is
> a reference to any entity in the Nomura Holdings, Inc. group. Please read
> our Electronic Communications Legal Notice which forms part of this e-mail:
> http://www.Nomura.com/email_disclaimer.htm
>

-- 
Want to work at Handy? Check out our culture deck and open roles 
<http://www.handy.com/careers>
Latest news <http://www.handy.com/press> at Handy
Handy just raised $50m 
<http://venturebeat.com/2015/11/02/on-demand-home-service-handy-raises-50m-in-round-led-by-fidelity/> led 
by Fidelity