You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Tom Ferguson <to...@gmail.com> on 2012/04/08 12:37:44 UTC

Structuring MapReduce Jobs

Hello,

I'm very new to Hadoop and I am trying to carry out of proof of concept for
processing some trading data. I am from a .net background, so I am trying
to prove whether it can be done primarily using C#, therefore I am looking
at the Hadoop Streaming job (from the Hadoop examples) to call in to some
C# executables.

My problem is, I am not certain of the best way to structure my jobs to
process the data in the way I want.

I have data stored in an RDBMS in the following format:

ID TradeID  Date  Value
---------------------------------------------
1 1  2012-01-01 12.34
2 1  2012-01-02 12.56
3 1  2012-01-03 13.78
4 2  2012-01-04 18.94
5 2  2012-05-17 19.32
6 2  2012-05-18 19.63
7 3  2012-05-19 17.32
What I want to do is take all the Dates & Values for a given TradeID into a
mathematical function that will spit out the same set of Dates but will
have recalculated all the Values. I hope that makes sense.. e.g.

Date Value
---------------------------
2012-01-01 12.34
2012-01-02 12.56
2012-01-03 13.78
will have the mathematical function applied and spit out

Date Value
---------------------------
2012-01-01 28.74
2012-01-02 31.29
2012-01-03 29.93
I am not exactly sure how to achieve this using Hadoop Streaming, but my
thoughts so far are...


   1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
   by TradeID - will this guarantee that all the the data points for a given
   TradeID will be processed by the same Map task??
   2. Write a Map task as a C# executable that will stream data in in the
   format (ID, TradeID, Date, Value)
   3. Gather all the data points for a given TradeID together into an array
   (or other datastructure)
   4. Pass the array into the mathematical function
   5. Get the results back as another array
   6. Stream the results back out in the format (TradeID, Date, ResultValue)

I will have around 500,000 Trade IDs, with up to 3,000 data points each, so
I am hoping that the data/processing will be distributed appropriately by
Hadoop.

Now, this seams a little bit long winded, but is this the best way of doing
it, based on the constraints of having to use C# for writing my tasks? In
the example above I do not have a Reduce job at all. Is that right in my
scenario?

Thanks for any help you can give and apologies if I am asking stupid
questions here!

Kind Regards,

Tom

Re: Structuring MapReduce Jobs

Posted by Deepak Nettem <de...@gmail.com>.

What a well structured question!

On Sun, Apr 8, 2012 at 6:37 AM, Tom Ferguson <to...@gmail.com> wrote:

> Hello,
>
> I'm very new to Hadoop and I am trying to carry out of proof of concept for
> processing some trading data. I am from a .net background, so I am trying
> to prove whether it can be done primarily using C#, therefore I am looking
> at the Hadoop Streaming job (from the Hadoop examples) to call in to some
> C# executables.
>
> My problem is, I am not certain of the best way to structure my jobs to
> process the data in the way I want.
>
> I have data stored in an RDBMS in the following format:
>
> ID TradeID  Date  Value
> ---------------------------------------------
> 1 1  2012-01-01 12.34
> 2 1  2012-01-02 12.56
> 3 1  2012-01-03 13.78
> 4 2  2012-01-04 18.94
> 5 2  2012-05-17 19.32
> 6 2  2012-05-18 19.63
> 7 3  2012-05-19 17.32
> What I want to do is take all the Dates & Values for a given TradeID into a
> mathematical function that will spit out the same set of Dates but will
> have recalculated all the Values. I hope that makes sense.. e.g.
>
> Date Value
> ---------------------------
> 2012-01-01 12.34
> 2012-01-02 12.56
> 2012-01-03 13.78
> will have the mathematical function applied and spit out
>
> Date Value
> ---------------------------
> 2012-01-01 28.74
> 2012-01-02 31.29
> 2012-01-03 29.93
> I am not exactly sure how to achieve this using Hadoop Streaming, but my
> thoughts so far are...
>
>
>   1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
>   by TradeID - will this guarantee that all the the data points for a given
>   TradeID will be processed by the same Map task??
>   2. Write a Map task as a C# executable that will stream data in in the
>   format (ID, TradeID, Date, Value)
>   3. Gather all the data points for a given TradeID together into an array
>   (or other datastructure)
>

A Naive Way -

The Mapper will need to emit key,value pairs where TradeID = key, and the
entire record is the value.

Hadoop will make sure that all key,value pairs with the same key land up in
the same reducer. In Java for example, all records for the same TradeID
would become available as an Iterable collection.

The Reducer can apply the mathematical function that you're talking about.

Another Way -

If it is guaranteed that records with the same TradeID occur one after the
other (and occur a fixed number of times, say 'k' times), then you can use
a custom input format that makes available to the mapper 'k' records at a
time, instead of 1. The mapper can then apply mathematical function. No
reducer would be required in this case.


>   4. Pass the array into the mathematical function
>   5. Get the results back as another array
>   6. Stream the results back out in the format (TradeID, Date, ResultValue)
>
> I will have around 500,000 Trade IDs, with up to 3,000 data points each, so
> I am hoping that the data/processing will be distributed appropriately by
> Hadoop.
>
> Now, this seams a little bit long winded, but is this the best way of doing
> it, based on the constraints of having to use C# for writing my tasks? In
> the example above I do not have a Reduce job at all. Is that right in my
> scenario?
>
> Thanks for any help you can give and apologies if I am asking stupid
> questions here!
>
> Kind Regards,
>
> Tom
>


Deepak Nettem
MS CS
SUNY Stony Brook

Re: Structuring MapReduce Jobs

Posted by Jay Vyas <ja...@gmail.com>.

Hi : Well phrased question .... I think you will need to read up on
reducers, and then you will see the light.

1) in your mapper, emit (date,tradeValue) objects.

2) Then hadoop will send the following to the reducers.

date1,tradeValues[]
date2,tradeValues[]
...


3) Then, in your reducer, you will apply the function to the whole set of
trade values.

4) Note that the mappers will split on files - they are not gaurantees that
any particular data will be sent to the mappers. If you want the any data
to be "grouped", you will need to write a mapper that performs this
grouping on an arbitrarily large data set, and then your group specific
statistics will have to be done at the reducer stage.

Think of it this way : The mapper does the grouping of inputs for reducers,
and the reducers then do the group specific logic.  For example, in word
count, the mappers emit individual words - the reducers recieve a large
group of numbers for each individual word, and sum them to emit a total
count.  In your case, the words are like the raw bank records - and the
function you are applying to records from a certain "date" is like the sum
function in the word count reducer.







On Mon, Apr 9, 2012 at 11:45 AM, Tom Ferguson <to...@gmail.com> wrote:

> Resending my query below... it didn't seem to post first time.
>
> Thanks,
>
> Tom
> On Apr 8, 2012 11:37 AM, "Tom Ferguson" <to...@gmail.com> wrote:
>
> > Hello,
> >
> > I'm very new to Hadoop and I am trying to carry out of proof of concept
> > for processing some trading data. I am from a .net background, so I am
> > trying to prove whether it can be done primarily using C#, therefore I am
> > looking at the Hadoop Streaming job (from the Hadoop examples) to call in
> > to some C# executables.
> >
> > My problem is, I am not certain of the best way to structure my jobs to
> > process the data in the way I want.
> >
> > I have data stored in an RDBMS in the following format:
> >
> > ID TradeID  Date  Value
> > ---------------------------------------------
> > 1 1  2012-01-01 12.34
> > 2 1  2012-01-02 12.56
> > 3 1  2012-01-03 13.78
> > 4 2  2012-01-04 18.94
> > 5 2  2012-05-17 19.32
> > 6 2  2012-05-18 19.63
> > 7 3  2012-05-19 17.32
> > What I want to do is take all the Dates & Values for a given TradeID into
> > a mathematical function that will spit out the same set of Dates but will
> > have recalculated all the Values. I hope that makes sense.. e.g.
> >
> > Date Value
> > ---------------------------
> > 2012-01-01 12.34
> > 2012-01-02 12.56
> > 2012-01-03 13.78
> > will have the mathematical function applied and spit out
> >
> > Date Value
> > ---------------------------
> > 2012-01-01 28.74
> > 2012-01-02 31.29
> > 2012-01-03 29.93
> > I am not exactly sure how to achieve this using Hadoop Streaming, but my
> > thoughts so far are...
> >
> >
> >    1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
> >    by TradeID - will this guarantee that all the the data points for a
> given
> >    TradeID will be processed by the same Map task??
> >    2. Write a Map task as a C# executable that will stream data in in the
> >    format (ID, TradeID, Date, Value)
> >    3. Gather all the data points for a given TradeID together into an
> >    array (or other datastructure)
> >    4. Pass the array into the mathematical function
> >    5. Get the results back as another array
> >    6. Stream the results back out in the format (TradeID, Date,
> >    ResultValue)
> >
> > I will have around 500,000 Trade IDs, with up to 3,000 data points each,
> > so I am hoping that the data/processing will be distributed appropriately
> > by Hadoop.
> >
> > Now, this seams a little bit long winded, but is this the best way of
> > doing it, based on the constraints of having to use C# for writing my
> > tasks? In the example above I do not have a Reduce job at all. Is that
> > right in my scenario?
> >
> > Thanks for any help you can give and apologies if I am asking stupid
> > questions here!
> >
> > Kind Regards,
> >
> > Tom
> >
>



-- 
Jay Vyas
MMSB/UCHC

Structuring MapReduce Jobs

Posted by Tom Ferguson <to...@gmail.com>.

Resending my query below... it didn't seem to post first time.

Thanks,

Tom
On Apr 8, 2012 11:37 AM, "Tom Ferguson" <to...@gmail.com> wrote:

> Hello,
>
> I'm very new to Hadoop and I am trying to carry out of proof of concept
> for processing some trading data. I am from a .net background, so I am
> trying to prove whether it can be done primarily using C#, therefore I am
> looking at the Hadoop Streaming job (from the Hadoop examples) to call in
> to some C# executables.
>
> My problem is, I am not certain of the best way to structure my jobs to
> process the data in the way I want.
>
> I have data stored in an RDBMS in the following format:
>
> ID TradeID  Date  Value
> ---------------------------------------------
> 1 1  2012-01-01 12.34
> 2 1  2012-01-02 12.56
> 3 1  2012-01-03 13.78
> 4 2  2012-01-04 18.94
> 5 2  2012-05-17 19.32
> 6 2  2012-05-18 19.63
> 7 3  2012-05-19 17.32
> What I want to do is take all the Dates & Values for a given TradeID into
> a mathematical function that will spit out the same set of Dates but will
> have recalculated all the Values. I hope that makes sense.. e.g.
>
> Date Value
> ---------------------------
> 2012-01-01 12.34
> 2012-01-02 12.56
> 2012-01-03 13.78
> will have the mathematical function applied and spit out
>
> Date Value
> ---------------------------
> 2012-01-01 28.74
> 2012-01-02 31.29
> 2012-01-03 29.93
> I am not exactly sure how to achieve this using Hadoop Streaming, but my
> thoughts so far are...
>
>
>    1. Us Sqoop to take the data out of the RDBMS and in to HDFS and split
>    by TradeID - will this guarantee that all the the data points for a given
>    TradeID will be processed by the same Map task??
>    2. Write a Map task as a C# executable that will stream data in in the
>    format (ID, TradeID, Date, Value)
>    3. Gather all the data points for a given TradeID together into an
>    array (or other datastructure)
>    4. Pass the array into the mathematical function
>    5. Get the results back as another array
>    6. Stream the results back out in the format (TradeID, Date,
>    ResultValue)
>
> I will have around 500,000 Trade IDs, with up to 3,000 data points each,
> so I am hoping that the data/processing will be distributed appropriately
> by Hadoop.
>
> Now, this seams a little bit long winded, but is this the best way of
> doing it, based on the constraints of having to use C# for writing my
> tasks? In the example above I do not have a Reduce job at all. Is that
> right in my scenario?
>
> Thanks for any help you can give and apologies if I am asking stupid
> questions here!
>
> Kind Regards,
>
> Tom
>