You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Chuck Lan <cy...@gmail.com> on 2008/02/22 18:13:31 UTC

Calculations involve large datasets

Hi,

I'm currently looking into how to better scale the performance of our
calculations involving large sets of financial data.  It is currently using
a series of Oracle SQL statements to perform the calculations.  It seems to
me that the MapReduce algorithm may work in this scenario.  However, I
believe would need to perform some denormalization of data in order for this
to work.  Do I have to?  Or is there a good way to implement joins within
the Hadoop framework efficiently?

Thanks,
Chuck

RE: Calculations involve large datasets

Posted by Runping Qi <ru...@yahoo-inc.com>.
There is a package for joining data from multiple sources: 

contrib/data-join.

It implements the basic joining logic and allows the user to provide
application specific logic for filtering/projecting and combining
multiple records into one.

Runping


> -----Original Message-----
> From: Ted Dunning [mailto:tdunning@veoh.com]
> Sent: Friday, February 22, 2008 3:58 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Calculations involve large datasets
> 
> 
> 
> Joins are easy.
> 
> Just reduce on a key composed of the stuff you want to join on.  If
the
> data
> you are joining is disparate, leave some kind of hint about what kind
of
> record you have.
> 
> The reducer will be iterating through sets of records that have the
same
> key.  This is similar to the results of an outer join, except that if
you
> are joining A and B and there are multiple records with the join key
in
> either A or B, you will see them in the same reduce.  In many such
cases,
> MR
> is actually more efficient than a traditional join because you don't
> necessarily want to generate the cross product of records.
> 
> In the reduce, you should build your composite record or do your
composite
> on a virtual composite join.
> 
> If you are doing a many-to-one join, then you often want the one to
appear
> before the many to avoid having to buffer the many until you see the
one.
> This can be done by sorting on your group key plus a source key, but
> grouping on just the group key.
> 
> 
> You should definitely look at Pig as well since it might fit what I
would
> presume to be a fairly SQL centric culture better than writing large
Java
> programs.  Last time I looked (a few months ago), it was definitely
not
> ready for us and we have gone other directions.  The pace of change
has
> been
> prodigous, however, so I expect it is much better than when I last
looked
> hard.
> 
> 
> On 2/22/08 10:12 AM, "Tim Wintle" <ti...@teamrubber.com> wrote:
> 
> > Have you seen PIG:
> > http://incubator.apache.org/pig/
> >
> > It generates hadoop code and is more query like, and (as far as I
> > remember) includes union, join, etc.
> >
> > Tim
> >
> > On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
> >> Hi,
> >>
> >> I'm currently looking into how to better scale the performance of
our
> >> calculations involving large sets of financial data.  It is
currently
> using
> >> a series of Oracle SQL statements to perform the calculations.  It
> seems to
> >> me that the MapReduce algorithm may work in this scenario.
However, I
> >> believe would need to perform some denormalization of data in order
for
> this
> >> to work.  Do I have to?  Or is there a good way to implement joins
> within
> >> the Hadoop framework efficiently?
> >>
> >> Thanks,
> >> Chuck
> >


RE: Calculations involve large datasets

Posted by Chuck Lan <cl...@modeln.com>.
Thanks for the explanation.  Now I just gotta find some time to do a
POC!

-Chuck

-----Original Message-----
From: Ted Dunning [mailto:tdunning@veoh.com] 
Sent: Friday, February 22, 2008 3:58 PM
To: core-user@hadoop.apache.org
Subject: Re: Calculations involve large datasets



Joins are easy.

Just reduce on a key composed of the stuff you want to join on.  If the
data
you are joining is disparate, leave some kind of hint about what kind of
record you have.

The reducer will be iterating through sets of records that have the same
key.  This is similar to the results of an outer join, except that if
you
are joining A and B and there are multiple records with the join key in
either A or B, you will see them in the same reduce.  In many such
cases, MR
is actually more efficient than a traditional join because you don't
necessarily want to generate the cross product of records.

In the reduce, you should build your composite record or do your
composite
on a virtual composite join.

If you are doing a many-to-one join, then you often want the one to
appear
before the many to avoid having to buffer the many until you see the
one.
This can be done by sorting on your group key plus a source key, but
grouping on just the group key.


You should definitely look at Pig as well since it might fit what I
would
presume to be a fairly SQL centric culture better than writing large
Java
programs.  Last time I looked (a few months ago), it was definitely not
ready for us and we have gone other directions.  The pace of change has
been
prodigous, however, so I expect it is much better than when I last
looked
hard.


On 2/22/08 10:12 AM, "Tim Wintle" <ti...@teamrubber.com> wrote:

> Have you seen PIG:
> http://incubator.apache.org/pig/
> 
> It generates hadoop code and is more query like, and (as far as I
> remember) includes union, join, etc.
> 
> Tim
> 
> On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
>> Hi,
>> 
>> I'm currently looking into how to better scale the performance of our
>> calculations involving large sets of financial data.  It is currently
using
>> a series of Oracle SQL statements to perform the calculations.  It
seems to
>> me that the MapReduce algorithm may work in this scenario.  However,
I
>> believe would need to perform some denormalization of data in order
for this
>> to work.  Do I have to?  Or is there a good way to implement joins
within
>> the Hadoop framework efficiently?
>> 
>> Thanks,
>> Chuck
> 


Re: Calculations involve large datasets

Posted by Ted Dunning <td...@veoh.com>.

Joins are easy.

Just reduce on a key composed of the stuff you want to join on.  If the data
you are joining is disparate, leave some kind of hint about what kind of
record you have.

The reducer will be iterating through sets of records that have the same
key.  This is similar to the results of an outer join, except that if you
are joining A and B and there are multiple records with the join key in
either A or B, you will see them in the same reduce.  In many such cases, MR
is actually more efficient than a traditional join because you don't
necessarily want to generate the cross product of records.

In the reduce, you should build your composite record or do your composite
on a virtual composite join.

If you are doing a many-to-one join, then you often want the one to appear
before the many to avoid having to buffer the many until you see the one.
This can be done by sorting on your group key plus a source key, but
grouping on just the group key.


You should definitely look at Pig as well since it might fit what I would
presume to be a fairly SQL centric culture better than writing large Java
programs.  Last time I looked (a few months ago), it was definitely not
ready for us and we have gone other directions.  The pace of change has been
prodigous, however, so I expect it is much better than when I last looked
hard.


On 2/22/08 10:12 AM, "Tim Wintle" <ti...@teamrubber.com> wrote:

> Have you seen PIG:
> http://incubator.apache.org/pig/
> 
> It generates hadoop code and is more query like, and (as far as I
> remember) includes union, join, etc.
> 
> Tim
> 
> On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
>> Hi,
>> 
>> I'm currently looking into how to better scale the performance of our
>> calculations involving large sets of financial data.  It is currently using
>> a series of Oracle SQL statements to perform the calculations.  It seems to
>> me that the MapReduce algorithm may work in this scenario.  However, I
>> believe would need to perform some denormalization of data in order for this
>> to work.  Do I have to?  Or is there a good way to implement joins within
>> the Hadoop framework efficiently?
>> 
>> Thanks,
>> Chuck
> 


Re: Calculations involve large datasets

Posted by Tim Wintle <ti...@teamrubber.com>.
Have you seen PIG:
http://incubator.apache.org/pig/

It generates hadoop code and is more query like, and (as far as I
remember) includes union, join, etc.

Tim

On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote:
> Hi,
> 
> I'm currently looking into how to better scale the performance of our
> calculations involving large sets of financial data.  It is currently using
> a series of Oracle SQL statements to perform the calculations.  It seems to
> me that the MapReduce algorithm may work in this scenario.  However, I
> believe would need to perform some denormalization of data in order for this
> to work.  Do I have to?  Or is there a good way to implement joins within
> the Hadoop framework efficiently?
> 
> Thanks,
> Chuck


Re: Calculations involve large datasets

Posted by Amar Kamat <am...@yahoo-inc.com>.
See http://incubator.apache.org/pig/. Hope that helps. Not sure how joins 
could be done in Hadoop.
Amar
On Fri, 22 Feb 2008, Chuck Lan wrote:

> Hi,
>
> I'm currently looking into how to better scale the performance of our
> calculations involving large sets of financial data.  It is currently using
> a series of Oracle SQL statements to perform the calculations.  It seems to
> me that the MapReduce algorithm may work in this scenario.  However, I
> believe would need to perform some denormalization of data in order for this
> to work.  Do I have to?  Or is there a good way to implement joins within
> the Hadoop framework efficiently?
>
> Thanks,
> Chuck
>