You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@crunch.apache.org by Ashish <pa...@gmail.com> on 2012/11/16 14:42:58 UTC

JoinFn queries

Folks,

Have few queries around JoinFn

1. Which all join function need one of the PTables in memory? from
documentation, I could get MapsideJoin has this.

2. I am playing around with JoinFn to merge two datasets, scenario is
detailed below.

Scenario: Cooked this up to play around with Crunch

One file has Ads Returned and time stamp in format
<Ad Id>, <long timestamp>

Other file has just Ad Ids, for which impressions were received
<Ad Id>

The objective is to join the data so that we can know which Ads got
impressions and impression table would be 90%(random) the size of Ads
table. In short, the table cannot fit in memory.

The way I did the join is, load both of them in PTable. For Ads returned
table (Ad Id, timestamp) and for Impression Table, its Ad Id and an Integer

And join them using the code

PTable<String, Pair<Long, Long>> joinedData =
Join.leftJoin(adsReturnedTable, impressionTable);

return is Ad Id, timestamp, Is Impressed

The code is working for small test data set. One problem I am facing is,
for the Ad Ids, where impression is not present, the output is like

a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]

for other it's
f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]

a. How can I make a 0 (zero) appear when the match is not found. From my
exploration, I need to write join(), and add check on pair.second() while
emitting. Is there a another way for achieve this.

3. How can be hook custom output formatter while writing PTable. like for
the above output, want to get something like

f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1

I plan to publish the finished code and all the finding in 4th blog post on
crunch.

-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: JoinFn queries

Posted by Ashish <pa...@gmail.com>.
Thanks Gabriel !

Let me try these tricks out.

Seems like we can have a section for Crunch Recipes or Crunchies :)


On Fri, Nov 16, 2012 at 9:17 PM, Gabriel Reid <ga...@gmail.com>wrote:

> Hi Ashish,
>
> Answers to your questions inlined below.
>
> > 1. Which all join function need one of the PTables in memory? from
> > documentation, I could get MapsideJoin has this.
>
> MapsideJoin is indeed the only join implementation that loads a side
> of the join into memory. The other (core) join implementations rely
> fully on the MapReduce framework to bring linked records together.
> Obviously this means that MapsideJoin should only be used if one side
> of your join is small enough to fit in memory (although in this case,
> you can get much better performance).
>
> > 2. I am playing around with JoinFn to merge two datasets, scenario is
> > detailed below.
> >
> > Scenario: Cooked this up to play around with Crunch
> >
> > One file has Ads Returned and time stamp in format
> > <Ad Id>, <long timestamp>
> >
> > Other file has just Ad Ids, for which impressions were received
> > <Ad Id>
> >
> > The objective is to join the data so that we can know which Ads got
> > impressions and impression table would be 90%(random) the size of Ads
> table.
> > In short, the table cannot fit in memory.
> >
> > The way I did the join is, load both of them in PTable. For Ads returned
> > table (Ad Id, timestamp) and for Impression Table, its Ad Id and an
> Integer
> >
> > And join them using the code
> >
> > PTable<String, Pair<Long, Long>> joinedData =
> > Join.leftJoin(adsReturnedTable, impressionTable);
> >
> > return is Ad Id, timestamp, Is Impressed
> >
>
> The approach that you're taking sounds good, and should scale up
> without problems.
>
>
> > The code is working for small test data set. One problem I am facing is,
> for
> > the Ad Ids, where impression is not present, the output is like
> >
> > a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]
> >
> > for other it's
> > f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]
> >
> > a. How can I make a 0 (zero) appear when the match is not found. From my
> > exploration, I need to write join(), and add check on pair.second() while
> > emitting. Is there a another way for achieve this.
>
> The impression value is null in the value pair in this case. You can
> replace this with a zero by doing something like the following calling
> parallelDo on the joined PTable with your own subclass of MapFn. The
> MapFn subclass just needs to replace the Pair containing a null with a
> Pair containing a 0 as the second value.
>
> > 3. How can be hook custom output formatter while writing PTable. like for
> > the above output, want to get something like
> >
> > f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1
>
> The easiest way to do this is to just implement a MapFn that does the
> necessary string formatting in the map method, and then apply it to
> the PTable just before you write the output.
>
> > I plan to publish the finished code and all the finding in 4th blog post
> on
> > crunch.
>
> Cool! Spreading the word about Crunch definitely sounds good.
>
> Hope all this helps, and let me know if anything isn't clear.
>
> Regards,
>
> Gabriel
>



-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: JoinFn queries

Posted by Gabriel Reid <ga...@gmail.com>.
Hi Ashish,

Answers to your questions inlined below.

> 1. Which all join function need one of the PTables in memory? from
> documentation, I could get MapsideJoin has this.

MapsideJoin is indeed the only join implementation that loads a side
of the join into memory. The other (core) join implementations rely
fully on the MapReduce framework to bring linked records together.
Obviously this means that MapsideJoin should only be used if one side
of your join is small enough to fit in memory (although in this case,
you can get much better performance).

> 2. I am playing around with JoinFn to merge two datasets, scenario is
> detailed below.
>
> Scenario: Cooked this up to play around with Crunch
>
> One file has Ads Returned and time stamp in format
> <Ad Id>, <long timestamp>
>
> Other file has just Ad Ids, for which impressions were received
> <Ad Id>
>
> The objective is to join the data so that we can know which Ads got
> impressions and impression table would be 90%(random) the size of Ads table.
> In short, the table cannot fit in memory.
>
> The way I did the join is, load both of them in PTable. For Ads returned
> table (Ad Id, timestamp) and for Impression Table, its Ad Id and an Integer
>
> And join them using the code
>
> PTable<String, Pair<Long, Long>> joinedData =
> Join.leftJoin(adsReturnedTable, impressionTable);
>
> return is Ad Id, timestamp, Is Impressed
>

The approach that you're taking sounds good, and should scale up
without problems.


> The code is working for small test data set. One problem I am facing is, for
> the Ad Ids, where impression is not present, the output is like
>
> a18f1f89-21e1-4fa9-8d24-54702fb9bdeb [1353062206438,]
>
> for other it's
> f2978128-6e40-4edb-ad3a-5e0ce5e11440 [1353062206479,1]
>
> a. How can I make a 0 (zero) appear when the match is not found. From my
> exploration, I need to write join(), and add check on pair.second() while
> emitting. Is there a another way for achieve this.

The impression value is null in the value pair in this case. You can
replace this with a zero by doing something like the following calling
parallelDo on the joined PTable with your own subclass of MapFn. The
MapFn subclass just needs to replace the Pair containing a null with a
Pair containing a 0 as the second value.

> 3. How can be hook custom output formatter while writing PTable. like for
> the above output, want to get something like
>
> f2978128-6e40-4edb-ad3a-5e0ce5e11440,1353062206479,1

The easiest way to do this is to just implement a MapFn that does the
necessary string formatting in the map method, and then apply it to
the PTable just before you write the output.

> I plan to publish the finished code and all the finding in 4th blog post on
> crunch.

Cool! Spreading the word about Crunch definitely sounds good.

Hope all this helps, and let me know if anything isn't clear.

Regards,

Gabriel