You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-user@hadoop.apache.org by Sarath <sa...@algofusiontech.com> on 2012/10/04 15:58:48 UTC

Cumulative value using mapreduce

Hi,

I have a file which has some financial transaction data. Each 
transaction will have amount and a credit/debit indicator.
I want to write a mapreduce program which computes cumulative credit & 
debit amounts at each record
and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic 
of computing the cumulative values?

Regards,
Sarath.

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Hi Yong,

Could you share more details about the HIVE UDF you have written for 
this use case?
As suggested, I would like to try this approach and see if that 
simplifies the solution to my requirement.

~Sarath.


On Friday 05 October 2012 12:32 AM, java8964 java8964 wrote:
> I did the cumulative sum in the HIVE UDF, as one of the project for my 
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For 
> example, an account, a department etc. In the mapper, combine these 
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a 
> cumulative sum for all your data, then send all the data to one common 
> key, so they will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have 
> a sorting order? If so, you need to do the 2nd sorting, so the data 
> will be sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original 
> record (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if 
> you can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------------------------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative 
> sum.
>
> This can be done in reducer exactly as Bertrand described except for 
> two points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have 
> different performance characteristics than aggregation programs like 
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important 
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     It sounds like a
>     1) group information by account
>     2) compute sum per account
>
>     If that not the case, you should precise a bit more about your
>     context.
>
>     This computing looks like a small variant of wordcount. If you do
>     not know how to do it, you should read books about Hadoop
>     MapReduce and/or online tutorial. Yahoo's is old but still a nice
>     read to begin with : http://developer.yahoo.com/hadoop/tutorial/
>
>     Regards,
>
>     Bertrand
>
>
>     On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>     <sarathchandra.josyam@algofusiontech.com
>     <ma...@algofusiontech.com>> wrote:
>
>         Hi,
>
>         I have a file which has some financial transaction data. Each
>         transaction will have amount and a credit/debit indicator.
>         I want to write a mapreduce program which computes cumulative
>         credit & debit amounts at each record
>         and append these values to the record before dumping into the
>         output file.
>
>         Is this possible? How can I achieve this? Where should i put
>         the logic of computing the cumulative values?
>
>         Regards,
>         Sarath.
>
>
>
>
>     -- 
>     Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Hi Yong,

Could you share more details about the HIVE UDF you have written for 
this use case?
As suggested, I would like to try this approach and see if that 
simplifies the solution to my requirement.

~Sarath.


On Friday 05 October 2012 12:32 AM, java8964 java8964 wrote:
> I did the cumulative sum in the HIVE UDF, as one of the project for my 
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For 
> example, an account, a department etc. In the mapper, combine these 
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a 
> cumulative sum for all your data, then send all the data to one common 
> key, so they will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have 
> a sorting order? If so, you need to do the 2nd sorting, so the data 
> will be sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original 
> record (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if 
> you can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------------------------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative 
> sum.
>
> This can be done in reducer exactly as Bertrand described except for 
> two points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have 
> different performance characteristics than aggregation programs like 
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important 
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     It sounds like a
>     1) group information by account
>     2) compute sum per account
>
>     If that not the case, you should precise a bit more about your
>     context.
>
>     This computing looks like a small variant of wordcount. If you do
>     not know how to do it, you should read books about Hadoop
>     MapReduce and/or online tutorial. Yahoo's is old but still a nice
>     read to begin with : http://developer.yahoo.com/hadoop/tutorial/
>
>     Regards,
>
>     Bertrand
>
>
>     On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>     <sarathchandra.josyam@algofusiontech.com
>     <ma...@algofusiontech.com>> wrote:
>
>         Hi,
>
>         I have a file which has some financial transaction data. Each
>         transaction will have amount and a credit/debit indicator.
>         I want to write a mapreduce program which computes cumulative
>         credit & debit amounts at each record
>         and append these values to the record before dumping into the
>         output file.
>
>         Is this possible? How can I achieve this? Where should i put
>         the logic of computing the cumulative values?
>
>         Regards,
>         Sarath.
>
>
>
>
>     -- 
>     Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Hi Yong,

Could you share more details about the HIVE UDF you have written for 
this use case?
As suggested, I would like to try this approach and see if that 
simplifies the solution to my requirement.

~Sarath.


On Friday 05 October 2012 12:32 AM, java8964 java8964 wrote:
> I did the cumulative sum in the HIVE UDF, as one of the project for my 
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For 
> example, an account, a department etc. In the mapper, combine these 
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a 
> cumulative sum for all your data, then send all the data to one common 
> key, so they will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have 
> a sorting order? If so, you need to do the 2nd sorting, so the data 
> will be sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original 
> record (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if 
> you can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------------------------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative 
> sum.
>
> This can be done in reducer exactly as Bertrand described except for 
> two points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have 
> different performance characteristics than aggregation programs like 
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important 
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     It sounds like a
>     1) group information by account
>     2) compute sum per account
>
>     If that not the case, you should precise a bit more about your
>     context.
>
>     This computing looks like a small variant of wordcount. If you do
>     not know how to do it, you should read books about Hadoop
>     MapReduce and/or online tutorial. Yahoo's is old but still a nice
>     read to begin with : http://developer.yahoo.com/hadoop/tutorial/
>
>     Regards,
>
>     Bertrand
>
>
>     On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>     <sarathchandra.josyam@algofusiontech.com
>     <ma...@algofusiontech.com>> wrote:
>
>         Hi,
>
>         I have a file which has some financial transaction data. Each
>         transaction will have amount and a credit/debit indicator.
>         I want to write a mapreduce program which computes cumulative
>         credit & debit amounts at each record
>         and append these values to the record before dumping into the
>         output file.
>
>         Is this possible? How can I achieve this? Where should i put
>         the logic of computing the cumulative values?
>
>         Regards,
>         Sarath.
>
>
>
>
>     -- 
>     Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

The provided example records are perfect. With that I doubt there will be
any confusion about what kind of data is available and it should be
manipulated. However, "the output is not coming as desired" is vague. It's
hard to say why you are not getting your expected result without a bit more
information about what has been done.

The aim is to compute cumulative credit & debit amounts (like you said)
using a sequence of records that need be sorted by date (and transaction id
if the order inside the day is relevant and if the transaction id is
monotonically
increasing.) The mapper won't have much logic and will be only responsible
for transforming the records so that the sort happens as expect. The
<key,value> would be something like <[date,transactionId],[CR/DR,amount]>.
And the reducer would apply the logic of calculating the cumulative sums.

I can see different variations. Like
* what exactly should be the reducer input value : [CR/DR,amount] or only a
signed amount. It doesn't change the logic much but it could help reducing
the volume of data. Alternatives for serialization and compression should
also be explored.
* whether several reducers should be used or not. More than one could be
used but then in order to have the full cumulative sums, a kind of
post-reduce merge should be performed. The last results of a file will be
CR/DR offsets that should be applied to the results of the next file. The
partitioning will greatly depends on the processed time range and the
associated data volumes.
* what group should be used by the reducer : only one group (with all
values sorted inside this single group) or one group per date with internal
sorting per transaction id or one group per [date,transactionId]. I
honestly don't know the impact that each would have without doing
benchmarks.

Yet, all these details might be way of your real problems. So if you
provide more details about your actual computation and results, you might
receive more constructive answers with regard to your problem.

Regards

Bertrand

On Fri, Oct 5, 2012 at 6:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <ja...@gmail.com>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
>> implement depending on which API/library/tool you are using. Ted comments
>> on performance are spot on.
>>
>>  Regards
>>
>>  Bertrand
>>
>> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>>
>>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>>> employer.
>>>
>>>  1) You need to decide the grouping elements for your cumulative. For
>>> example, an account, a department etc. In the mapper, combine these
>>> information as your omit key.
>>> 2) If you don't have any grouping requirement, you just want a
>>> cumulative sum for all your data, then send all the data to one common key,
>>> so they will all go to the same reducer.
>>> 3) When you calculate the cumulative sum, does the output need to have a
>>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>>> sorted as the order you want in the reducer.
>>> 4) In the reducer, just do the sum, omit every value per original record
>>> (Not per key).
>>>
>>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>>> you can build a HIVE schema on top of your data.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> From: tdunning@maprtech.com
>>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>>> Subject: Re: Cumulative value using mapreduce
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Bertrand is almost right.
>>>
>>>  The only difference is that the original poster asked about cumulative
>>> sum.
>>>
>>>  This can be done in reducer exactly as Bertrand described except for
>>> two points that make it different from word count:
>>>
>>>  a) you can't use a combiner
>>>
>>>  b) the output of the program is as large as the input so it will have
>>> different performance characteristics than aggregation programs like
>>> wordcount.
>>>
>>>  Bertrand's key recommendation to go read a book is the most important
>>> advice.
>>>
>>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>> Hi,
>>>
>>>  It sounds like a
>>> 1) group information by account
>>> 2) compute sum per account
>>>
>>>  If that not the case, you should precise a bit more about your context.
>>>
>>>  This computing looks like a small variant of wordcount. If you do not
>>> know how to do it, you should read books about Hadoop MapReduce and/or
>>> online tutorial. Yahoo's is old but still a nice read to begin with :
>>> http://developer.yahoo.com/hadoop/tutorial/
>>>
>>>  Regards,
>>>
>>>  Bertrand
>>>
>>>
>>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a file which has some financial transaction data. Each
>>> transaction will have amount and a credit/debit indicator.
>>> I want to write a mapreduce program which computes cumulative credit &
>>> debit amounts at each record
>>> and append these values to the record before dumping into the output
>>> file.
>>>
>>> Is this possible? How can I achieve this? Where should i put the logic
>>> of computing the cumulative values?
>>>
>>> Regards,
>>> Sarath.
>>>
>>>
>>>
>>>
>>>   --
>>> Bertrand Dechoux
>>>
>>>
>>>
>>
>>
>>  --
>> Bertrand Dechoux
>>
>>
>

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <ja...@gmail.com>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
>> implement depending on which API/library/tool you are using. Ted comments
>> on performance are spot on.
>>
>>  Regards
>>
>>  Bertrand
>>
>> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>>
>>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>>> employer.
>>>
>>>  1) You need to decide the grouping elements for your cumulative. For
>>> example, an account, a department etc. In the mapper, combine these
>>> information as your omit key.
>>> 2) If you don't have any grouping requirement, you just want a
>>> cumulative sum for all your data, then send all the data to one common key,
>>> so they will all go to the same reducer.
>>> 3) When you calculate the cumulative sum, does the output need to have a
>>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>>> sorted as the order you want in the reducer.
>>> 4) In the reducer, just do the sum, omit every value per original record
>>> (Not per key).
>>>
>>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>>> you can build a HIVE schema on top of your data.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> From: tdunning@maprtech.com
>>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>>> Subject: Re: Cumulative value using mapreduce
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Bertrand is almost right.
>>>
>>>  The only difference is that the original poster asked about cumulative
>>> sum.
>>>
>>>  This can be done in reducer exactly as Bertrand described except for
>>> two points that make it different from word count:
>>>
>>>  a) you can't use a combiner
>>>
>>>  b) the output of the program is as large as the input so it will have
>>> different performance characteristics than aggregation programs like
>>> wordcount.
>>>
>>>  Bertrand's key recommendation to go read a book is the most important
>>> advice.
>>>
>>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>> Hi,
>>>
>>>  It sounds like a
>>> 1) group information by account
>>> 2) compute sum per account
>>>
>>>  If that not the case, you should precise a bit more about your context.
>>>
>>>  This computing looks like a small variant of wordcount. If you do not
>>> know how to do it, you should read books about Hadoop MapReduce and/or
>>> online tutorial. Yahoo's is old but still a nice read to begin with :
>>> http://developer.yahoo.com/hadoop/tutorial/
>>>
>>>  Regards,
>>>
>>>  Bertrand
>>>
>>>
>>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a file which has some financial transaction data. Each
>>> transaction will have amount and a credit/debit indicator.
>>> I want to write a mapreduce program which computes cumulative credit &
>>> debit amounts at each record
>>> and append these values to the record before dumping into the output
>>> file.
>>>
>>> Is this possible? How can I achieve this? Where should i put the logic
>>> of computing the cumulative values?
>>>
>>> Regards,
>>> Sarath.
>>>
>>>
>>>
>>>
>>>   --
>>> Bertrand Dechoux
>>>
>>>
>>>
>>
>>
>>  --
>> Bertrand Dechoux
>>
>>
>

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <ja...@gmail.com>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
>> implement depending on which API/library/tool you are using. Ted comments
>> on performance are spot on.
>>
>>  Regards
>>
>>  Bertrand
>>
>> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>>
>>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>>> employer.
>>>
>>>  1) You need to decide the grouping elements for your cumulative. For
>>> example, an account, a department etc. In the mapper, combine these
>>> information as your omit key.
>>> 2) If you don't have any grouping requirement, you just want a
>>> cumulative sum for all your data, then send all the data to one common key,
>>> so they will all go to the same reducer.
>>> 3) When you calculate the cumulative sum, does the output need to have a
>>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>>> sorted as the order you want in the reducer.
>>> 4) In the reducer, just do the sum, omit every value per original record
>>> (Not per key).
>>>
>>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>>> you can build a HIVE schema on top of your data.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> From: tdunning@maprtech.com
>>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>>> Subject: Re: Cumulative value using mapreduce
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Bertrand is almost right.
>>>
>>>  The only difference is that the original poster asked about cumulative
>>> sum.
>>>
>>>  This can be done in reducer exactly as Bertrand described except for
>>> two points that make it different from word count:
>>>
>>>  a) you can't use a combiner
>>>
>>>  b) the output of the program is as large as the input so it will have
>>> different performance characteristics than aggregation programs like
>>> wordcount.
>>>
>>>  Bertrand's key recommendation to go read a book is the most important
>>> advice.
>>>
>>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>> Hi,
>>>
>>>  It sounds like a
>>> 1) group information by account
>>> 2) compute sum per account
>>>
>>>  If that not the case, you should precise a bit more about your context.
>>>
>>>  This computing looks like a small variant of wordcount. If you do not
>>> know how to do it, you should read books about Hadoop MapReduce and/or
>>> online tutorial. Yahoo's is old but still a nice read to begin with :
>>> http://developer.yahoo.com/hadoop/tutorial/
>>>
>>>  Regards,
>>>
>>>  Bertrand
>>>
>>>
>>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a file which has some financial transaction data. Each
>>> transaction will have amount and a credit/debit indicator.
>>> I want to write a mapreduce program which computes cumulative credit &
>>> debit amounts at each record
>>> and append these values to the record before dumping into the output
>>> file.
>>>
>>> Is this possible? How can I achieve this? Where should i put the logic
>>> of computing the cumulative values?
>>>
>>> Regards,
>>> Sarath.
>>>
>>>
>>>
>>>
>>>   --
>>> Bertrand Dechoux
>>>
>>>
>>>
>>
>>
>>  --
>> Bertrand Dechoux
>>
>>
>

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

i'm reading the other posts. i had assume you had +1 reducers.

if you just have 1 reducer, then no matter what, every key-value pair goes
there. so, in that case, i agree with java8964. you emit all records with
one key to that one reducer. make sure you apply secondary sorting (that
means you will have to come up with a composite key). when the data comes
into the reducer, just keep a running count and emit each time.

On Fri, Oct 5, 2012 at 11:21 AM, Jane Wayne <ja...@gmail.com>wrote:

> there's probably a million ways to do it, but it seems like it can be
> done, per your question. off the top of my head, you'd probably want to do
> the cumulative sum in the reducer. if you're savy, maybe even make the
> reducer reusable as a combiner (looks like this problem might have an
> associative and commutative reducer).
>
> the difficulty with this problem is that for n input records, you will
> have n output records (looking at your example). furthermore, each n-th
> output record requires information from all the previous (n-1) records. so,
> if you have 1 billion input records, it's looking like you may have to move
> a lot of intermediary key-value pairs to your reducer.
>
> here's a suggestion and please critique, perhaps i may learn something.
> let's take a naive approach. i assume you have this data in a text file
> with CSV. i assume the Tx Ids are sequential, and you know what the
> start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
> following.
>
> map(byteOffset, text) {
>  data = parse(text)
>  for i=data.txId to stopTxId
>   emit(i, data)
> }
>
> reduce(txId, datas) {
>  cr = 0
>  dr = 0
>
>  while datas.hasMoreItems
>   data = data.nextItem //iterate
>   if "dr" == data.crDrIndicator
>    dr += data.amount
>   else
>    cr += data.amount
>
>  emit(txId, {cr, dr})
> }
>
> what's not desirable about this pseudocode?
> 1. lots of intermediary key-value pairs
> 2. no combiner
> 3. requires knowledge of background information and certain assumptions
> 4. will definitely create "stragglers" (some mappers/reducers will take
> longer to complete than others)
> 5. overflow issues with the cumulative sum?
>
> i thought about the secondary sorting idea, but i'm still not sure how
> that can work. what would you sort on?
>
> one of the things i learned in programming 101, get the algorithm to work
> first, then optimize later. hope this helps. please feel free to critique.
> would love to learn some more.
>
> On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>>  Thanks for all your responses. As suggested will go through the
>> documentation once again.
>>
>> But just to clarify, this is not my first map-reduce program. I've
>> already written a map-reduce for our product which does filtering and
>> transformation of the financial data. This is a new requirement we've got.
>> I have also did the logic of calculating the cumulative sums. But the
>> output is not coming as desired and I feel I'm not doing it right way and
>> missing something. So thought of taking a quick help from the mailing list.
>>
>> As an example, say we have records as below -
>>   Txn ID
>>  Txn Date
>>  Cr/Dr Indicator
>>  Amount
>>   1001
>>  9/22/2012
>>  CR
>>  1000
>>   1002
>>  9/25/2012
>>  DR
>>  500
>>   1003
>>  10/1/2012
>>  DR
>>  1500
>>   1004
>>  10/4/2012
>>  CR
>>  2000
>>
>> When this file passed the logic should append the below 2 columns to the
>> output for each record above -
>>   CR Cumulative Amount
>>  DR Cumulative Amount
>>   1000
>>  0
>>   1000
>>  500
>>   1000
>>  2000
>>   3000
>>  2000
>>
>> Hope the problem is clear now. Please provide your suggestions on the
>> approach to the solution.
>>
>> Regards,
>> Sarath.
>>
>>
>> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>>
>> I indeed didn't catch the cumulative sum part. Then I guess it begs for
>> what-is-often-called-a-secondary-sort, if you want to compute different
>> cumulative sums during the same job. It can be more or less easy to
>> implement depending on which API/library/tool you are using. Ted comments
>> on performance are spot on.
>>
>>  Regards
>>
>>  Bertrand
>>
>> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>>
>>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>>> employer.
>>>
>>>  1) You need to decide the grouping elements for your cumulative. For
>>> example, an account, a department etc. In the mapper, combine these
>>> information as your omit key.
>>> 2) If you don't have any grouping requirement, you just want a
>>> cumulative sum for all your data, then send all the data to one common key,
>>> so they will all go to the same reducer.
>>> 3) When you calculate the cumulative sum, does the output need to have a
>>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>>> sorted as the order you want in the reducer.
>>> 4) In the reducer, just do the sum, omit every value per original record
>>> (Not per key).
>>>
>>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>>> you can build a HIVE schema on top of your data.
>>>
>>>  Yong
>>>
>>>  ------------------------------
>>> From: tdunning@maprtech.com
>>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>>> Subject: Re: Cumulative value using mapreduce
>>> To: user@hadoop.apache.org
>>>
>>>
>>> Bertrand is almost right.
>>>
>>>  The only difference is that the original poster asked about cumulative
>>> sum.
>>>
>>>  This can be done in reducer exactly as Bertrand described except for
>>> two points that make it different from word count:
>>>
>>>  a) you can't use a combiner
>>>
>>>  b) the output of the program is as large as the input so it will have
>>> different performance characteristics than aggregation programs like
>>> wordcount.
>>>
>>>  Bertrand's key recommendation to go read a book is the most important
>>> advice.
>>>
>>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>>
>>> Hi,
>>>
>>>  It sounds like a
>>> 1) group information by account
>>> 2) compute sum per account
>>>
>>>  If that not the case, you should precise a bit more about your context.
>>>
>>>  This computing looks like a small variant of wordcount. If you do not
>>> know how to do it, you should read books about Hadoop MapReduce and/or
>>> online tutorial. Yahoo's is old but still a nice read to begin with :
>>> http://developer.yahoo.com/hadoop/tutorial/
>>>
>>>  Regards,
>>>
>>>  Bertrand
>>>
>>>
>>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>>> sarathchandra.josyam@algofusiontech.com> wrote:
>>>
>>> Hi,
>>>
>>> I have a file which has some financial transaction data. Each
>>> transaction will have amount and a credit/debit indicator.
>>> I want to write a mapreduce program which computes cumulative credit &
>>> debit amounts at each record
>>> and append these values to the record before dumping into the output
>>> file.
>>>
>>> Is this possible? How can I achieve this? Where should i put the logic
>>> of computing the cumulative values?
>>>
>>> Regards,
>>> Sarath.
>>>
>>>
>>>
>>>
>>>   --
>>> Bertrand Dechoux
>>>
>>>
>>>
>>
>>
>>  --
>> Bertrand Dechoux
>>
>>
>

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

there's probably a million ways to do it, but it seems like it can be done,
per your question. off the top of my head, you'd probably want to do
the cumulative sum in the reducer. if you're savy, maybe even make the
reducer reusable as a combiner (looks like this problem might have an
associative and commutative reducer).

the difficulty with this problem is that for n input records, you will have
n output records (looking at your example). furthermore, each n-th output
record requires information from all the previous (n-1) records. so, if you
have 1 billion input records, it's looking like you may have to move a lot
of intermediary key-value pairs to your reducer.

here's a suggestion and please critique, perhaps i may learn something.
let's take a naive approach. i assume you have this data in a text file
with CSV. i assume the Tx Ids are sequential, and you know what the
start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
following.

map(byteOffset, text) {
 data = parse(text)
 for i=data.txId to stopTxId
  emit(i, data)
}

reduce(txId, datas) {
 cr = 0
 dr = 0

 while datas.hasMoreItems
  data = data.nextItem //iterate
  if "dr" == data.crDrIndicator
   dr += data.amount
  else
   cr += data.amount

 emit(txId, {cr, dr})
}

what's not desirable about this pseudocode?
1. lots of intermediary key-value pairs
2. no combiner
3. requires knowledge of background information and certain assumptions
4. will definitely create "stragglers" (some mappers/reducers will take
longer to complete than others)
5. overflow issues with the cumulative sum?

i thought about the secondary sorting idea, but i'm still not sure how that
can work. what would you sort on?

one of the things i learned in programming 101, get the algorithm to work
first, then optimize later. hope this helps. please feel free to critique.
would love to learn some more.

On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

The provided example records are perfect. With that I doubt there will be
any confusion about what kind of data is available and it should be
manipulated. However, "the output is not coming as desired" is vague. It's
hard to say why you are not getting your expected result without a bit more
information about what has been done.

The aim is to compute cumulative credit & debit amounts (like you said)
using a sequence of records that need be sorted by date (and transaction id
if the order inside the day is relevant and if the transaction id is
monotonically
increasing.) The mapper won't have much logic and will be only responsible
for transforming the records so that the sort happens as expect. The
<key,value> would be something like <[date,transactionId],[CR/DR,amount]>.
And the reducer would apply the logic of calculating the cumulative sums.

I can see different variations. Like
* what exactly should be the reducer input value : [CR/DR,amount] or only a
signed amount. It doesn't change the logic much but it could help reducing
the volume of data. Alternatives for serialization and compression should
also be explored.
* whether several reducers should be used or not. More than one could be
used but then in order to have the full cumulative sums, a kind of
post-reduce merge should be performed. The last results of a file will be
CR/DR offsets that should be applied to the results of the next file. The
partitioning will greatly depends on the processed time range and the
associated data volumes.
* what group should be used by the reducer : only one group (with all
values sorted inside this single group) or one group per date with internal
sorting per transaction id or one group per [date,transactionId]. I
honestly don't know the impact that each would have without doing
benchmarks.

Yet, all these details might be way of your real problems. So if you
provide more details about your actual computation and results, you might
receive more constructive answers with regard to your problem.

Regards

Bertrand

On Fri, Oct 5, 2012 at 6:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

-- 
Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

Are you allowed to change the order of the data in the output? If you want to calculate the cr/dr indicator cumulative sum value, then it will easy if the business allow you to change the order of your data group by CR/DR indicator in the output.
For example, you can do it very easy with the way I described in my original email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount       Dr Cumulative Amount1001                   CR                         1000                1000                                          01004                   CR                         2000                3000                                          01002                   DR                          500                 0                                              5001003                   DR                         1500                0                                            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If you want to keep the original order, then it is hard, at least I cannot think a way in short time.
But if you allow to change the order of the output, then it is called cumulative sum with grouping (in this case, it is group1 for CR, group 2 for DR). 
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data by CR/DR. So all CR data will go to one reducer, then all DR data will go to one reducer.2) Besides grouping the data, if you want the output sorted by the Amount (for example) in each group, then you have to do the 2nd sorting. Google 2nd sort. Then for each group, the data arriving into each reducer will be sorted by amount. Otherwise, if you don't need that sorting, then just ignore the 2nd sorting.3) In each reducer, the data arriving should be already grouped. The default partitioner for MR job is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR', these 2 groups data could go to different reducers (assuming you are running with multi reducers), or they could go to the same reducers. But even they are going to the same reducer, they will be arrived into 2 groups. So the output of your reducers will be grouped, which is sorted by the way.4) In your reducers, for the same group data, you will get an array of values. For CR, you will get all the CR records in the array. What you need to do is to Iterating your array, for every element, calculating the cumulative sum, and omit the cumulative sum with the each record out.5) In the end, your output could be multi files, as each file generated from one reducer. You can merge them into one file, or just leave them as that in the HDFS.6) For best performance, if you have huge data, AND you know all your possible value for THE Indicator, you may want to consider use your own custom Partitioner, instead of HashPartitioner. What you want is like a RoundRobin distribution of your keys inside the available reducers, instead of Random distribution by hash value(). Keep in mind that random distribution DOES NOT work well if the distinct count of your keys is small enough.
Yong


Date: Fri, 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


  
    
  
  
    Thanks for all your responses. As
      suggested will go through the documentation once again.

      

      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

      

      As an example, say we have records as below -

      
        
          
            Txn ID

            
            Txn Date

            
            Cr/Dr Indicator

            
            Amount

            
          
          
            1001

            
            9/22/2012

            
            CR

            
            1000

            
          
          
            1002

            
            9/25/2012

            
            DR

            
            500

            
          
          
            1003

            
            10/1/2012

            
            DR

            
            1500

            
          
          
            1004

            
            10/4/2012

            
            CR

            
            2000

            
          
        
      
      

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

      
        
          
            CR Cumulative Amount

            
            DR Cumulative Amount

            
          
          
            1000

            
            0

            
          
          
            1000

            
            500

            
          
          
            1000

            
            2000

            
          
          
            3000

            
            2000

            
          
        
      
      

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

      

      Regards,

      Sarath.

      

      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      
        

      
      Regards
      

      
      Bertrand

        

        On Thu, Oct 4, 2012 at 9:02 PM,
          java8964 java8964 <ja...@hotmail.com> wrote:

          
            
              
                I did the cumulative sum in the HIVE UDF, as one of the
                project for my employer.
                

                
                1) You need to decide the grouping elements for
                  your cumulative. For example, an account, a department
                  etc. In the mapper, combine these information as your
                  omit key.
                2) If you don't have any grouping requirement, you
                  just want a cumulative sum for all your data, then
                  send all the data to one common key, so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum, does the
                  output need to have a sorting order? If so, you need
                  to do the 2nd sorting, so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer, just do the sum, omit every
                  value per original record (Not per key).
                

                
                I will suggest you do this in the UDF of HIVE, as
                  it is much easy, if you can build a HIVE schema on top
                  of your data.
                

                
                Yong

                  

                  
                    From: tdunning@maprtech.com

                    Date: Thu, 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                    
                      

                        

                        Bertrand is almost right.
                        

                        
                        The only difference is that the original
                          poster asked about cumulative sum.
                        

                        
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                        

                        
                        a) you can't use a combiner
                        

                        
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                        

                        
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                          

                          On Thu, Oct 4, 2012 at 5:20 PM, Bertrand
                            Dechoux <de...@gmail.com>
                            wrote:

                            Hi,
                              

                              
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                              

                              
                              If that not the case, you should
                                precise a bit more about your context.
                              
                                

                              
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it, you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/
                              

                              
                              Regards,
                              

                              
                              Bertrand
                                
                                  

                                    

                                    On Thu, Oct 4, 2012 at 3:58 PM,
                                      Sarath <sa...@algofusiontech.com>
                                      wrote:

                                      Hi,

                                        

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                        

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                        

                                        Regards,

                                        Sarath.

                                      
                                    
                                    

                                    
                                    

                                    
                                  
                                
                                -- 

                                    Bertrand Dechoux

                                  
                            
                          
                          

                        
                      
                    
                  
                
              
            
          
        
        

        
        

        
        -- 

        Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

Are you allowed to change the order of the data in the output? If you want to calculate the cr/dr indicator cumulative sum value, then it will easy if the business allow you to change the order of your data group by CR/DR indicator in the output.
For example, you can do it very easy with the way I described in my original email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount       Dr Cumulative Amount1001                   CR                         1000                1000                                          01004                   CR                         2000                3000                                          01002                   DR                          500                 0                                              5001003                   DR                         1500                0                                            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If you want to keep the original order, then it is hard, at least I cannot think a way in short time.
But if you allow to change the order of the output, then it is called cumulative sum with grouping (in this case, it is group1 for CR, group 2 for DR). 
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data by CR/DR. So all CR data will go to one reducer, then all DR data will go to one reducer.2) Besides grouping the data, if you want the output sorted by the Amount (for example) in each group, then you have to do the 2nd sorting. Google 2nd sort. Then for each group, the data arriving into each reducer will be sorted by amount. Otherwise, if you don't need that sorting, then just ignore the 2nd sorting.3) In each reducer, the data arriving should be already grouped. The default partitioner for MR job is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR', these 2 groups data could go to different reducers (assuming you are running with multi reducers), or they could go to the same reducers. But even they are going to the same reducer, they will be arrived into 2 groups. So the output of your reducers will be grouped, which is sorted by the way.4) In your reducers, for the same group data, you will get an array of values. For CR, you will get all the CR records in the array. What you need to do is to Iterating your array, for every element, calculating the cumulative sum, and omit the cumulative sum with the each record out.5) In the end, your output could be multi files, as each file generated from one reducer. You can merge them into one file, or just leave them as that in the HDFS.6) For best performance, if you have huge data, AND you know all your possible value for THE Indicator, you may want to consider use your own custom Partitioner, instead of HashPartitioner. What you want is like a RoundRobin distribution of your keys inside the available reducers, instead of Random distribution by hash value(). Keep in mind that random distribution DOES NOT work well if the distinct count of your keys is small enough.
Yong


Date: Fri, 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


  
    
  
  
    Thanks for all your responses. As
      suggested will go through the documentation once again.

      

      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

      

      As an example, say we have records as below -

      
        
          
            Txn ID

            
            Txn Date

            
            Cr/Dr Indicator

            
            Amount

            
          
          
            1001

            
            9/22/2012

            
            CR

            
            1000

            
          
          
            1002

            
            9/25/2012

            
            DR

            
            500

            
          
          
            1003

            
            10/1/2012

            
            DR

            
            1500

            
          
          
            1004

            
            10/4/2012

            
            CR

            
            2000

            
          
        
      
      

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

      
        
          
            CR Cumulative Amount

            
            DR Cumulative Amount

            
          
          
            1000

            
            0

            
          
          
            1000

            
            500

            
          
          
            1000

            
            2000

            
          
          
            3000

            
            2000

            
          
        
      
      

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

      

      Regards,

      Sarath.

      

      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      
        

      
      Regards
      

      
      Bertrand

        

        On Thu, Oct 4, 2012 at 9:02 PM,
          java8964 java8964 <ja...@hotmail.com> wrote:

          
            
              
                I did the cumulative sum in the HIVE UDF, as one of the
                project for my employer.
                

                
                1) You need to decide the grouping elements for
                  your cumulative. For example, an account, a department
                  etc. In the mapper, combine these information as your
                  omit key.
                2) If you don't have any grouping requirement, you
                  just want a cumulative sum for all your data, then
                  send all the data to one common key, so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum, does the
                  output need to have a sorting order? If so, you need
                  to do the 2nd sorting, so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer, just do the sum, omit every
                  value per original record (Not per key).
                

                
                I will suggest you do this in the UDF of HIVE, as
                  it is much easy, if you can build a HIVE schema on top
                  of your data.
                

                
                Yong

                  

                  
                    From: tdunning@maprtech.com

                    Date: Thu, 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                    
                      

                        

                        Bertrand is almost right.
                        

                        
                        The only difference is that the original
                          poster asked about cumulative sum.
                        

                        
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                        

                        
                        a) you can't use a combiner
                        

                        
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                        

                        
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                          

                          On Thu, Oct 4, 2012 at 5:20 PM, Bertrand
                            Dechoux <de...@gmail.com>
                            wrote:

                            Hi,
                              

                              
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                              

                              
                              If that not the case, you should
                                precise a bit more about your context.
                              
                                

                              
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it, you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/
                              

                              
                              Regards,
                              

                              
                              Bertrand
                                
                                  

                                    

                                    On Thu, Oct 4, 2012 at 3:58 PM,
                                      Sarath <sa...@algofusiontech.com>
                                      wrote:

                                      Hi,

                                        

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                        

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                        

                                        Regards,

                                        Sarath.

                                      
                                    
                                    

                                    
                                    

                                    
                                  
                                
                                -- 

                                    Bertrand Dechoux

                                  
                            
                          
                          

                        
                      
                    
                  
                
              
            
          
        
        

        
        

        
        -- 

        Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

Are you allowed to change the order of the data in the output? If you want to calculate the cr/dr indicator cumulative sum value, then it will easy if the business allow you to change the order of your data group by CR/DR indicator in the output.
For example, you can do it very easy with the way I described in my original email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount       Dr Cumulative Amount1001                   CR                         1000                1000                                          01004                   CR                         2000                3000                                          01002                   DR                          500                 0                                              5001003                   DR                         1500                0                                            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If you want to keep the original order, then it is hard, at least I cannot think a way in short time.
But if you allow to change the order of the output, then it is called cumulative sum with grouping (in this case, it is group1 for CR, group 2 for DR). 
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data by CR/DR. So all CR data will go to one reducer, then all DR data will go to one reducer.2) Besides grouping the data, if you want the output sorted by the Amount (for example) in each group, then you have to do the 2nd sorting. Google 2nd sort. Then for each group, the data arriving into each reducer will be sorted by amount. Otherwise, if you don't need that sorting, then just ignore the 2nd sorting.3) In each reducer, the data arriving should be already grouped. The default partitioner for MR job is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR', these 2 groups data could go to different reducers (assuming you are running with multi reducers), or they could go to the same reducers. But even they are going to the same reducer, they will be arrived into 2 groups. So the output of your reducers will be grouped, which is sorted by the way.4) In your reducers, for the same group data, you will get an array of values. For CR, you will get all the CR records in the array. What you need to do is to Iterating your array, for every element, calculating the cumulative sum, and omit the cumulative sum with the each record out.5) In the end, your output could be multi files, as each file generated from one reducer. You can merge them into one file, or just leave them as that in the HDFS.6) For best performance, if you have huge data, AND you know all your possible value for THE Indicator, you may want to consider use your own custom Partitioner, instead of HashPartitioner. What you want is like a RoundRobin distribution of your keys inside the available reducers, instead of Random distribution by hash value(). Keep in mind that random distribution DOES NOT work well if the distinct count of your keys is small enough.
Yong


Date: Fri, 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


  
    
  
  
    Thanks for all your responses. As
      suggested will go through the documentation once again.

      

      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

      

      As an example, say we have records as below -

      
        
          
            Txn ID

            
            Txn Date

            
            Cr/Dr Indicator

            
            Amount

            
          
          
            1001

            
            9/22/2012

            
            CR

            
            1000

            
          
          
            1002

            
            9/25/2012

            
            DR

            
            500

            
          
          
            1003

            
            10/1/2012

            
            DR

            
            1500

            
          
          
            1004

            
            10/4/2012

            
            CR

            
            2000

            
          
        
      
      

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

      
        
          
            CR Cumulative Amount

            
            DR Cumulative Amount

            
          
          
            1000

            
            0

            
          
          
            1000

            
            500

            
          
          
            1000

            
            2000

            
          
          
            3000

            
            2000

            
          
        
      
      

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

      

      Regards,

      Sarath.

      

      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      
        

      
      Regards
      

      
      Bertrand

        

        On Thu, Oct 4, 2012 at 9:02 PM,
          java8964 java8964 <ja...@hotmail.com> wrote:

          
            
              
                I did the cumulative sum in the HIVE UDF, as one of the
                project for my employer.
                

                
                1) You need to decide the grouping elements for
                  your cumulative. For example, an account, a department
                  etc. In the mapper, combine these information as your
                  omit key.
                2) If you don't have any grouping requirement, you
                  just want a cumulative sum for all your data, then
                  send all the data to one common key, so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum, does the
                  output need to have a sorting order? If so, you need
                  to do the 2nd sorting, so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer, just do the sum, omit every
                  value per original record (Not per key).
                

                
                I will suggest you do this in the UDF of HIVE, as
                  it is much easy, if you can build a HIVE schema on top
                  of your data.
                

                
                Yong

                  

                  
                    From: tdunning@maprtech.com

                    Date: Thu, 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                    
                      

                        

                        Bertrand is almost right.
                        

                        
                        The only difference is that the original
                          poster asked about cumulative sum.
                        

                        
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                        

                        
                        a) you can't use a combiner
                        

                        
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                        

                        
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                          

                          On Thu, Oct 4, 2012 at 5:20 PM, Bertrand
                            Dechoux <de...@gmail.com>
                            wrote:

                            Hi,
                              

                              
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                              

                              
                              If that not the case, you should
                                precise a bit more about your context.
                              
                                

                              
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it, you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/
                              

                              
                              Regards,
                              

                              
                              Bertrand
                                
                                  

                                    

                                    On Thu, Oct 4, 2012 at 3:58 PM,
                                      Sarath <sa...@algofusiontech.com>
                                      wrote:

                                      Hi,

                                        

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                        

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                        

                                        Regards,

                                        Sarath.

                                      
                                    
                                    

                                    
                                    

                                    
                                  
                                
                                -- 

                                    Bertrand Dechoux

                                  
                            
                          
                          

                        
                      
                    
                  
                
              
            
          
        
        

        
        

        
        -- 

        Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

there's probably a million ways to do it, but it seems like it can be done,
per your question. off the top of my head, you'd probably want to do
the cumulative sum in the reducer. if you're savy, maybe even make the
reducer reusable as a combiner (looks like this problem might have an
associative and commutative reducer).

the difficulty with this problem is that for n input records, you will have
n output records (looking at your example). furthermore, each n-th output
record requires information from all the previous (n-1) records. so, if you
have 1 billion input records, it's looking like you may have to move a lot
of intermediary key-value pairs to your reducer.

here's a suggestion and please critique, perhaps i may learn something.
let's take a naive approach. i assume you have this data in a text file
with CSV. i assume the Tx Ids are sequential, and you know what the
start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
following.

map(byteOffset, text) {
 data = parse(text)
 for i=data.txId to stopTxId
  emit(i, data)
}

reduce(txId, datas) {
 cr = 0
 dr = 0

 while datas.hasMoreItems
  data = data.nextItem //iterate
  if "dr" == data.crDrIndicator
   dr += data.amount
  else
   cr += data.amount

 emit(txId, {cr, dr})
}

what's not desirable about this pseudocode?
1. lots of intermediary key-value pairs
2. no combiner
3. requires knowledge of background information and certain assumptions
4. will definitely create "stragglers" (some mappers/reducers will take
longer to complete than others)
5. overflow issues with the cumulative sum?

i thought about the secondary sorting idea, but i'm still not sure how that
can work. what would you sort on?

one of the things i learned in programming 101, get the algorithm to work
first, then optimize later. hope this helps. please feel free to critique.
would love to learn some more.

On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Steve Loughran <st...@hortonworks.com>.

On 5 October 2012 06:50, Ted Dunning <td...@maprtech.com> wrote:

> negative numbers are a relatively new concept in accounting


since 2008, if I'm not mistaken

Re: Cumulative value using mapreduce

Posted by Steve Loughran <st...@hortonworks.com>.

On 5 October 2012 06:50, Ted Dunning <td...@maprtech.com> wrote:

> negative numbers are a relatively new concept in accounting


since 2008, if I'm not mistaken

Re: Cumulative value using mapreduce

Posted by Steve Loughran <st...@hortonworks.com>.

On 5 October 2012 06:50, Ted Dunning <td...@maprtech.com> wrote:

> negative numbers are a relatively new concept in accounting


since 2008, if I'm not mistaken

Re: Cumulative value using mapreduce

Posted by Steve Loughran <st...@hortonworks.com>.

On 5 October 2012 06:50, Ted Dunning <td...@maprtech.com> wrote:

> negative numbers are a relatively new concept in accounting


since 2008, if I'm not mistaken

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

The answer is really the same.  Your problem is just using a goofy
representation for negative numbers (after all, negative numbers are a
relatively new concept in accounting).

You still need to use the account number as the key and the date as a sort
key.  Many financial institutions also process all debits before credits on
a particular day in order to maximize overdraft fees so you may want to use
the CR/DR field as a secondary key in the sort.

Then the addition is field driven.  Add to one sum or the other and always
add both sums to the record.

On Fri, Oct 5, 2012 at 5:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

The provided example records are perfect. With that I doubt there will be
any confusion about what kind of data is available and it should be
manipulated. However, "the output is not coming as desired" is vague. It's
hard to say why you are not getting your expected result without a bit more
information about what has been done.

The aim is to compute cumulative credit & debit amounts (like you said)
using a sequence of records that need be sorted by date (and transaction id
if the order inside the day is relevant and if the transaction id is
monotonically
increasing.) The mapper won't have much logic and will be only responsible
for transforming the records so that the sort happens as expect. The
<key,value> would be something like <[date,transactionId],[CR/DR,amount]>.
And the reducer would apply the logic of calculating the cumulative sums.

I can see different variations. Like
* what exactly should be the reducer input value : [CR/DR,amount] or only a
signed amount. It doesn't change the logic much but it could help reducing
the volume of data. Alternatives for serialization and compression should
also be explored.
* whether several reducers should be used or not. More than one could be
used but then in order to have the full cumulative sums, a kind of
post-reduce merge should be performed. The last results of a file will be
CR/DR offsets that should be applied to the results of the next file. The
partitioning will greatly depends on the processed time range and the
associated data volumes.
* what group should be used by the reducer : only one group (with all
values sorted inside this single group) or one group per date with internal
sorting per transaction id or one group per [date,transactionId]. I
honestly don't know the impact that each would have without doing
benchmarks.

Yet, all these details might be way of your real problems. So if you
provide more details about your actual computation and results, you might
receive more constructive answers with regard to your problem.

Regards

Bertrand

On Fri, Oct 5, 2012 at 6:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

The provided example records are perfect. With that I doubt there will be
any confusion about what kind of data is available and it should be
manipulated. However, "the output is not coming as desired" is vague. It's
hard to say why you are not getting your expected result without a bit more
information about what has been done.

The aim is to compute cumulative credit & debit amounts (like you said)
using a sequence of records that need be sorted by date (and transaction id
if the order inside the day is relevant and if the transaction id is
monotonically
increasing.) The mapper won't have much logic and will be only responsible
for transforming the records so that the sort happens as expect. The
<key,value> would be something like <[date,transactionId],[CR/DR,amount]>.
And the reducer would apply the logic of calculating the cumulative sums.

I can see different variations. Like
* what exactly should be the reducer input value : [CR/DR,amount] or only a
signed amount. It doesn't change the logic much but it could help reducing
the volume of data. Alternatives for serialization and compression should
also be explored.
* whether several reducers should be used or not. More than one could be
used but then in order to have the full cumulative sums, a kind of
post-reduce merge should be performed. The last results of a file will be
CR/DR offsets that should be applied to the results of the next file. The
partitioning will greatly depends on the processed time range and the
associated data volumes.
* what group should be used by the reducer : only one group (with all
values sorted inside this single group) or one group per date with internal
sorting per transaction id or one group per [date,transactionId]. I
honestly don't know the impact that each would have without doing
benchmarks.

Yet, all these details might be way of your real problems. So if you
provide more details about your actual computation and results, you might
receive more constructive answers with regard to your problem.

Regards

Bertrand

On Fri, Oct 5, 2012 at 6:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

The answer is really the same.  Your problem is just using a goofy
representation for negative numbers (after all, negative numbers are a
relatively new concept in accounting).

You still need to use the account number as the key and the date as a sort
key.  Many financial institutions also process all debits before credits on
a particular day in order to maximize overdraft fees so you may want to use
the CR/DR field as a secondary key in the sort.

Then the addition is field driven.  Add to one sum or the other and always
add both sums to the record.

On Fri, Oct 5, 2012 at 5:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

The answer is really the same.  Your problem is just using a goofy
representation for negative numbers (after all, negative numbers are a
relatively new concept in accounting).

You still need to use the account number as the key and the date as a sort
key.  Many financial institutions also process all debits before credits on
a particular day in order to maximize overdraft fees so you may want to use
the CR/DR field as a secondary key in the sort.

Then the addition is field driven.  Add to one sum or the other and always
add both sums to the record.

On Fri, Oct 5, 2012 at 5:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

The answer is really the same.  Your problem is just using a goofy
representation for negative numbers (after all, negative numbers are a
relatively new concept in accounting).

You still need to use the account number as the key and the date as a sort
key.  Many financial institutions also process all debits before credits on
a particular day in order to maximize overdraft fees so you may want to use
the CR/DR field as a secondary key in the sort.

Then the addition is field driven.  Add to one sum or the other and always
add both sums to the record.

On Fri, Oct 5, 2012 at 5:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

Are you allowed to change the order of the data in the output? If you want to calculate the cr/dr indicator cumulative sum value, then it will easy if the business allow you to change the order of your data group by CR/DR indicator in the output.
For example, you can do it very easy with the way I described in my original email if you CAN change the output like following:
Txn ID             Cr/Dr Indicator         Amount      CR cumulative Amount       Dr Cumulative Amount1001                   CR                         1000                1000                                          01004                   CR                         2000                3000                                          01002                   DR                          500                 0                                              5001003                   DR                         1500                0                                            2000
As you can see, you have to group out your output by the Cr/Dr Indicator. If you want to keep the original order, then it is hard, at least I cannot think a way in short time.
But if you allow to change the order of the output, then it is called cumulative sum with grouping (in this case, it is group1 for CR, group 2 for DR). 
1) In the mapper, omit your data by Cr/Dr indicator, which will group the data by CR/DR. So all CR data will go to one reducer, then all DR data will go to one reducer.2) Besides grouping the data, if you want the output sorted by the Amount (for example) in each group, then you have to do the 2nd sorting. Google 2nd sort. Then for each group, the data arriving into each reducer will be sorted by amount. Otherwise, if you don't need that sorting, then just ignore the 2nd sorting.3) In each reducer, the data arriving should be already grouped. The default partitioner for MR job is Hash Partitioner. Depending on the hashCode() return for 'CR' and 'DR', these 2 groups data could go to different reducers (assuming you are running with multi reducers), or they could go to the same reducers. But even they are going to the same reducer, they will be arrived into 2 groups. So the output of your reducers will be grouped, which is sorted by the way.4) In your reducers, for the same group data, you will get an array of values. For CR, you will get all the CR records in the array. What you need to do is to Iterating your array, for every element, calculating the cumulative sum, and omit the cumulative sum with the each record out.5) In the end, your output could be multi files, as each file generated from one reducer. You can merge them into one file, or just leave them as that in the HDFS.6) For best performance, if you have huge data, AND you know all your possible value for THE Indicator, you may want to consider use your own custom Partitioner, instead of HashPartitioner. What you want is like a RoundRobin distribution of your keys inside the available reducers, instead of Random distribution by hash value(). Keep in mind that random distribution DOES NOT work well if the distinct count of your keys is small enough.
Yong


Date: Fri, 5 Oct 2012 10:26:43 +0530
From: sarathchandra.josyam@algofusiontech.com
To: user@hadoop.apache.org
Subject: Re: Cumulative value using mapreduce


  
    
  
  
    Thanks for all your responses. As
      suggested will go through the documentation once again.

      

      But just to clarify, this is not my first map-reduce program. I've
      already written a map-reduce for our product which does filtering
      and transformation of the financial data. This is a new
      requirement we've got. I have also did the logic of calculating
      the cumulative sums. But the output is not coming as desired and I
      feel I'm not doing it right way and missing something. So thought
      of taking a quick help from the mailing list.

      

      As an example, say we have records as below -

      
        
          
            Txn ID

            
            Txn Date

            
            Cr/Dr Indicator

            
            Amount

            
          
          
            1001

            
            9/22/2012

            
            CR

            
            1000

            
          
          
            1002

            
            9/25/2012

            
            DR

            
            500

            
          
          
            1003

            
            10/1/2012

            
            DR

            
            1500

            
          
          
            1004

            
            10/4/2012

            
            CR

            
            2000

            
          
        
      
      

      When this file passed the logic should append the below 2 columns
      to the output for each record above -

      
        
          
            CR Cumulative Amount

            
            DR Cumulative Amount

            
          
          
            1000

            
            0

            
          
          
            1000

            
            500

            
          
          
            1000

            
            2000

            
          
          
            3000

            
            2000

            
          
        
      
      

      Hope the problem is clear now. Please provide your suggestions on
      the approach to the solution.

      

      Regards,

      Sarath.

      

      On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:

    
    I indeed didn't catch the cumulative sum part. Then I
      guess it begs for what-is-often-called-a-secondary-sort, if you
      want to compute different cumulative sums during the same job. It
      can be more or less easy to implement depending on which
      API/library/tool you are using. Ted comments on performance are
      spot on.
      
        

      
      Regards
      

      
      Bertrand

        

        On Thu, Oct 4, 2012 at 9:02 PM,
          java8964 java8964 <ja...@hotmail.com> wrote:

          
            
              
                I did the cumulative sum in the HIVE UDF, as one of the
                project for my employer.
                

                
                1) You need to decide the grouping elements for
                  your cumulative. For example, an account, a department
                  etc. In the mapper, combine these information as your
                  omit key.
                2) If you don't have any grouping requirement, you
                  just want a cumulative sum for all your data, then
                  send all the data to one common key, so they will all
                  go to the same reducer.
                3) When you calculate the cumulative sum, does the
                  output need to have a sorting order? If so, you need
                  to do the 2nd sorting, so the data will be sorted as
                  the order you want in the reducer.
                4) In the reducer, just do the sum, omit every
                  value per original record (Not per key).
                

                
                I will suggest you do this in the UDF of HIVE, as
                  it is much easy, if you can build a HIVE schema on top
                  of your data.
                

                
                Yong

                  

                  
                    From: tdunning@maprtech.com

                    Date: Thu, 4 Oct 2012 18:52:09 +0100

                    Subject: Re: Cumulative value using mapreduce

                    To: user@hadoop.apache.org
                    
                      

                        

                        Bertrand is almost right.
                        

                        
                        The only difference is that the original
                          poster asked about cumulative sum.
                        

                        
                        This can be done in reducer exactly as
                          Bertrand described except for two points that
                          make it different from word count:
                        

                        
                        a) you can't use a combiner
                        

                        
                        b) the output of the program is as large as
                          the input so it will have different
                          performance characteristics than aggregation
                          programs like wordcount.
                        

                        
                        Bertrand's key recommendation to go read a
                          book is the most important advice.

                          

                          On Thu, Oct 4, 2012 at 5:20 PM, Bertrand
                            Dechoux <de...@gmail.com>
                            wrote:

                            Hi,
                              

                              
                              It sounds like a
                              1) group information by account
                              2) compute sum per account
                              

                              
                              If that not the case, you should
                                precise a bit more about your context.
                              
                                

                              
                              This computing looks like a small
                                variant of wordcount. If you do not know
                                how to do it, you should read books
                                about Hadoop MapReduce and/or online
                                tutorial. Yahoo's is old but still a
                                nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/
                              

                              
                              Regards,
                              

                              
                              Bertrand
                                
                                  

                                    

                                    On Thu, Oct 4, 2012 at 3:58 PM,
                                      Sarath <sa...@algofusiontech.com>
                                      wrote:

                                      Hi,

                                        

                                        I have a file which has some
                                        financial transaction data. Each
                                        transaction will have amount and
                                        a credit/debit indicator.

                                        I want to write a mapreduce
                                        program which computes
                                        cumulative credit & debit
                                        amounts at each record

                                        and append these values to the
                                        record before dumping into the
                                        output file.

                                        

                                        Is this possible? How can I
                                        achieve this? Where should i put
                                        the logic of computing the
                                        cumulative values?

                                        

                                        Regards,

                                        Sarath.

                                      
                                    
                                    

                                    
                                    

                                    
                                  
                                
                                -- 

                                    Bertrand Dechoux

                                  
                            
                          
                          

                        
                      
                    
                  
                
              
            
          
        
        

        
        

        
        -- 

        Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

there's probably a million ways to do it, but it seems like it can be done,
per your question. off the top of my head, you'd probably want to do
the cumulative sum in the reducer. if you're savy, maybe even make the
reducer reusable as a combiner (looks like this problem might have an
associative and commutative reducer).

the difficulty with this problem is that for n input records, you will have
n output records (looking at your example). furthermore, each n-th output
record requires information from all the previous (n-1) records. so, if you
have 1 billion input records, it's looking like you may have to move a lot
of intermediary key-value pairs to your reducer.

here's a suggestion and please critique, perhaps i may learn something.
let's take a naive approach. i assume you have this data in a text file
with CSV. i assume the Tx Ids are sequential, and you know what the
start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
following.

map(byteOffset, text) {
 data = parse(text)
 for i=data.txId to stopTxId
  emit(i, data)
}

reduce(txId, datas) {
 cr = 0
 dr = 0

 while datas.hasMoreItems
  data = data.nextItem //iterate
  if "dr" == data.crDrIndicator
   dr += data.amount
  else
   cr += data.amount

 emit(txId, {cr, dr})
}

what's not desirable about this pseudocode?
1. lots of intermediary key-value pairs
2. no combiner
3. requires knowledge of background information and certain assumptions
4. will definitely create "stragglers" (some mappers/reducers will take
longer to complete than others)
5. overflow issues with the cumulative sum?

i thought about the secondary sorting idea, but i'm still not sure how that
can work. what would you sort on?

one of the things i learned in programming 101, get the algorithm to work
first, then optimize later. hope this helps. please feel free to critique.
would love to learn some more.

On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Jane Wayne <ja...@gmail.com>.

there's probably a million ways to do it, but it seems like it can be done,
per your question. off the top of my head, you'd probably want to do
the cumulative sum in the reducer. if you're savy, maybe even make the
reducer reusable as a combiner (looks like this problem might have an
associative and commutative reducer).

the difficulty with this problem is that for n input records, you will have
n output records (looking at your example). furthermore, each n-th output
record requires information from all the previous (n-1) records. so, if you
have 1 billion input records, it's looking like you may have to move a lot
of intermediary key-value pairs to your reducer.

here's a suggestion and please critique, perhaps i may learn something.
let's take a naive approach. i assume you have this data in a text file
with CSV. i assume the Tx Ids are sequential, and you know what the
start/stop Tx Id is. the mapper/reducer "pseudocode" looks like the
following.

map(byteOffset, text) {
 data = parse(text)
 for i=data.txId to stopTxId
  emit(i, data)
}

reduce(txId, datas) {
 cr = 0
 dr = 0

 while datas.hasMoreItems
  data = data.nextItem //iterate
  if "dr" == data.crDrIndicator
   dr += data.amount
  else
   cr += data.amount

 emit(txId, {cr, dr})
}

what's not desirable about this pseudocode?
1. lots of intermediary key-value pairs
2. no combiner
3. requires knowledge of background information and certain assumptions
4. will definitely create "stragglers" (some mappers/reducers will take
longer to complete than others)
5. overflow issues with the cumulative sum?

i thought about the secondary sorting idea, but i'm still not sure how that
can work. what would you sort on?

one of the things i learned in programming 101, get the algorithm to work
first, then optimize later. hope this helps. please feel free to critique.
would love to learn some more.

On Fri, Oct 5, 2012 at 12:56 AM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

>  Thanks for all your responses. As suggested will go through the
> documentation once again.
>
> But just to clarify, this is not my first map-reduce program. I've already
> written a map-reduce for our product which does filtering and
> transformation of the financial data. This is a new requirement we've got.
> I have also did the logic of calculating the cumulative sums. But the
> output is not coming as desired and I feel I'm not doing it right way and
> missing something. So thought of taking a quick help from the mailing list.
>
> As an example, say we have records as below -
>   Txn ID
>  Txn Date
>  Cr/Dr Indicator
>  Amount
>   1001
>  9/22/2012
>  CR
>  1000
>   1002
>  9/25/2012
>  DR
>  500
>   1003
>  10/1/2012
>  DR
>  1500
>   1004
>  10/4/2012
>  CR
>  2000
>
> When this file passed the logic should append the below 2 columns to the
> output for each record above -
>   CR Cumulative Amount
>  DR Cumulative Amount
>   1000
>  0
>   1000
>  500
>   1000
>  2000
>   3000
>  2000
>
> Hope the problem is clear now. Please provide your suggestions on the
> approach to the solution.
>
> Regards,
> Sarath.
>
>
> On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
>
> I indeed didn't catch the cumulative sum part. Then I guess it begs for
> what-is-often-called-a-secondary-sort, if you want to compute different
> cumulative sums during the same job. It can be more or less easy to
> implement depending on which API/library/tool you are using. Ted comments
> on performance are spot on.
>
>  Regards
>
>  Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:
>
>>  I did the cumulative sum in the HIVE UDF, as one of the project for my
>> employer.
>>
>>  1) You need to decide the grouping elements for your cumulative. For
>> example, an account, a department etc. In the mapper, combine these
>> information as your omit key.
>> 2) If you don't have any grouping requirement, you just want a cumulative
>> sum for all your data, then send all the data to one common key, so they
>> will all go to the same reducer.
>> 3) When you calculate the cumulative sum, does the output need to have a
>> sorting order? If so, you need to do the 2nd sorting, so the data will be
>> sorted as the order you want in the reducer.
>> 4) In the reducer, just do the sum, omit every value per original record
>> (Not per key).
>>
>>  I will suggest you do this in the UDF of HIVE, as it is much easy, if
>> you can build a HIVE schema on top of your data.
>>
>>  Yong
>>
>>  ------------------------------
>> From: tdunning@maprtech.com
>> Date: Thu, 4 Oct 2012 18:52:09 +0100
>> Subject: Re: Cumulative value using mapreduce
>> To: user@hadoop.apache.org
>>
>>
>> Bertrand is almost right.
>>
>>  The only difference is that the original poster asked about cumulative
>> sum.
>>
>>  This can be done in reducer exactly as Bertrand described except for
>> two points that make it different from word count:
>>
>>  a) you can't use a combiner
>>
>>  b) the output of the program is as large as the input so it will have
>> different performance characteristics than aggregation programs like
>> wordcount.
>>
>>  Bertrand's key recommendation to go read a book is the most important
>> advice.
>>
>> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>>
>> Hi,
>>
>>  It sounds like a
>> 1) group information by account
>> 2) compute sum per account
>>
>>  If that not the case, you should precise a bit more about your context.
>>
>>  This computing looks like a small variant of wordcount. If you do not
>> know how to do it, you should read books about Hadoop MapReduce and/or
>> online tutorial. Yahoo's is old but still a nice read to begin with :
>> http://developer.yahoo.com/hadoop/tutorial/
>>
>>  Regards,
>>
>>  Bertrand
>>
>>
>> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
>> sarathchandra.josyam@algofusiontech.com> wrote:
>>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>>
>>
>>
>>   --
>> Bertrand Dechoux
>>
>>
>>
>
>
>  --
> Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Thanks for all your responses. As suggested will go through the 
documentation once again.

But just to clarify, this is not my first map-reduce program. I've 
already written a map-reduce for our product which does filtering and 
transformation of the financial data. This is a new requirement we've 
got. I have also did the logic of calculating the cumulative sums. But 
the output is not coming as desired and I feel I'm not doing it right 
way and missing something. So thought of taking a quick help from the 
mailing list.

As an example, say we have records as below -
Txn ID
	Txn Date
	Cr/Dr Indicator
	Amount
1001
	9/22/2012
	CR
	1000
1002
	9/25/2012
	DR
	500
1003
	10/1/2012
	DR
	1500
1004
	10/4/2012
	CR
	2000


When this file passed the logic should append the below 2 columns to the 
output for each record above -
CR Cumulative Amount
	DR Cumulative Amount
1000
	0
1000
	500
1000
	2000
3000
	2000


Hope the problem is clear now. Please provide your suggestions on the 
approach to the solution.

Regards,
Sarath.

On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
> I indeed didn't catch the cumulative sum part. Then I guess it begs 
> for what-is-often-called-a-secondary-sort, if you want to compute 
> different cumulative sums during the same job. It can be more or less 
> easy to implement depending on which API/library/tool you are using. 
> Ted comments on performance are spot on.
>
> Regards
>
> Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 
> <java8964@hotmail.com <ma...@hotmail.com>> wrote:
>
>     I did the cumulative sum in the HIVE UDF, as one of the project
>     for my employer.
>
>     1) You need to decide the grouping elements for your cumulative.
>     For example, an account, a department etc. In the mapper, combine
>     these information as your omit key.
>     2) If you don't have any grouping requirement, you just want a
>     cumulative sum for all your data, then send all the data to one
>     common key, so they will all go to the same reducer.
>     3) When you calculate the cumulative sum, does the output need to
>     have a sorting order? If so, you need to do the 2nd sorting, so
>     the data will be sorted as the order you want in the reducer.
>     4) In the reducer, just do the sum, omit every value per original
>     record (Not per key).
>
>     I will suggest you do this in the UDF of HIVE, as it is much easy,
>     if you can build a HIVE schema on top of your data.
>
>     Yong
>
>     ------------------------------------------------------------------------
>     From: tdunning@maprtech.com <ma...@maprtech.com>
>     Date: Thu, 4 Oct 2012 18:52:09 +0100
>     Subject: Re: Cumulative value using mapreduce
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>
>
>     Bertrand is almost right.
>
>     The only difference is that the original poster asked about
>     cumulative sum.
>
>     This can be done in reducer exactly as Bertrand described except
>     for two points that make it different from word count:
>
>     a) you can't use a combiner
>
>     b) the output of the program is as large as the input so it will
>     have different performance characteristics than aggregation
>     programs like wordcount.
>
>     Bertrand's key recommendation to go read a book is the most
>     important advice.
>
>     On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux
>     <dechouxb@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi,
>
>         It sounds like a
>         1) group information by account
>         2) compute sum per account
>
>         If that not the case, you should precise a bit more about your
>         context.
>
>         This computing looks like a small variant of wordcount. If you
>         do not know how to do it, you should read books about Hadoop
>         MapReduce and/or online tutorial. Yahoo's is old but still a
>         nice read to begin with :
>         http://developer.yahoo.com/hadoop/tutorial/
>
>         Regards,
>
>         Bertrand
>
>
>         On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>         <sarathchandra.josyam@algofusiontech.com
>         <ma...@algofusiontech.com>> wrote:
>
>             Hi,
>
>             I have a file which has some financial transaction data.
>             Each transaction will have amount and a credit/debit
>             indicator.
>             I want to write a mapreduce program which computes
>             cumulative credit & debit amounts at each record
>             and append these values to the record before dumping into
>             the output file.
>
>             Is this possible? How can I achieve this? Where should i
>             put the logic of computing the cumulative values?
>
>             Regards,
>             Sarath.
>
>
>
>
>         -- 
>         Bertrand Dechoux
>
>
>
>
>
> -- 
> Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Thanks for all your responses. As suggested will go through the 
documentation once again.

But just to clarify, this is not my first map-reduce program. I've 
already written a map-reduce for our product which does filtering and 
transformation of the financial data. This is a new requirement we've 
got. I have also did the logic of calculating the cumulative sums. But 
the output is not coming as desired and I feel I'm not doing it right 
way and missing something. So thought of taking a quick help from the 
mailing list.

As an example, say we have records as below -
Txn ID
	Txn Date
	Cr/Dr Indicator
	Amount
1001
	9/22/2012
	CR
	1000
1002
	9/25/2012
	DR
	500
1003
	10/1/2012
	DR
	1500
1004
	10/4/2012
	CR
	2000


When this file passed the logic should append the below 2 columns to the 
output for each record above -
CR Cumulative Amount
	DR Cumulative Amount
1000
	0
1000
	500
1000
	2000
3000
	2000


Hope the problem is clear now. Please provide your suggestions on the 
approach to the solution.

Regards,
Sarath.

On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
> I indeed didn't catch the cumulative sum part. Then I guess it begs 
> for what-is-often-called-a-secondary-sort, if you want to compute 
> different cumulative sums during the same job. It can be more or less 
> easy to implement depending on which API/library/tool you are using. 
> Ted comments on performance are spot on.
>
> Regards
>
> Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 
> <java8964@hotmail.com <ma...@hotmail.com>> wrote:
>
>     I did the cumulative sum in the HIVE UDF, as one of the project
>     for my employer.
>
>     1) You need to decide the grouping elements for your cumulative.
>     For example, an account, a department etc. In the mapper, combine
>     these information as your omit key.
>     2) If you don't have any grouping requirement, you just want a
>     cumulative sum for all your data, then send all the data to one
>     common key, so they will all go to the same reducer.
>     3) When you calculate the cumulative sum, does the output need to
>     have a sorting order? If so, you need to do the 2nd sorting, so
>     the data will be sorted as the order you want in the reducer.
>     4) In the reducer, just do the sum, omit every value per original
>     record (Not per key).
>
>     I will suggest you do this in the UDF of HIVE, as it is much easy,
>     if you can build a HIVE schema on top of your data.
>
>     Yong
>
>     ------------------------------------------------------------------------
>     From: tdunning@maprtech.com <ma...@maprtech.com>
>     Date: Thu, 4 Oct 2012 18:52:09 +0100
>     Subject: Re: Cumulative value using mapreduce
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>
>
>     Bertrand is almost right.
>
>     The only difference is that the original poster asked about
>     cumulative sum.
>
>     This can be done in reducer exactly as Bertrand described except
>     for two points that make it different from word count:
>
>     a) you can't use a combiner
>
>     b) the output of the program is as large as the input so it will
>     have different performance characteristics than aggregation
>     programs like wordcount.
>
>     Bertrand's key recommendation to go read a book is the most
>     important advice.
>
>     On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux
>     <dechouxb@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi,
>
>         It sounds like a
>         1) group information by account
>         2) compute sum per account
>
>         If that not the case, you should precise a bit more about your
>         context.
>
>         This computing looks like a small variant of wordcount. If you
>         do not know how to do it, you should read books about Hadoop
>         MapReduce and/or online tutorial. Yahoo's is old but still a
>         nice read to begin with :
>         http://developer.yahoo.com/hadoop/tutorial/
>
>         Regards,
>
>         Bertrand
>
>
>         On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>         <sarathchandra.josyam@algofusiontech.com
>         <ma...@algofusiontech.com>> wrote:
>
>             Hi,
>
>             I have a file which has some financial transaction data.
>             Each transaction will have amount and a credit/debit
>             indicator.
>             I want to write a mapreduce program which computes
>             cumulative credit & debit amounts at each record
>             and append these values to the record before dumping into
>             the output file.
>
>             Is this possible? How can I achieve this? Where should i
>             put the logic of computing the cumulative values?
>
>             Regards,
>             Sarath.
>
>
>
>
>         -- 
>         Bertrand Dechoux
>
>
>
>
>
> -- 
> Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Thanks for all your responses. As suggested will go through the 
documentation once again.

But just to clarify, this is not my first map-reduce program. I've 
already written a map-reduce for our product which does filtering and 
transformation of the financial data. This is a new requirement we've 
got. I have also did the logic of calculating the cumulative sums. But 
the output is not coming as desired and I feel I'm not doing it right 
way and missing something. So thought of taking a quick help from the 
mailing list.

As an example, say we have records as below -
Txn ID
	Txn Date
	Cr/Dr Indicator
	Amount
1001
	9/22/2012
	CR
	1000
1002
	9/25/2012
	DR
	500
1003
	10/1/2012
	DR
	1500
1004
	10/4/2012
	CR
	2000


When this file passed the logic should append the below 2 columns to the 
output for each record above -
CR Cumulative Amount
	DR Cumulative Amount
1000
	0
1000
	500
1000
	2000
3000
	2000


Hope the problem is clear now. Please provide your suggestions on the 
approach to the solution.

Regards,
Sarath.

On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
> I indeed didn't catch the cumulative sum part. Then I guess it begs 
> for what-is-often-called-a-secondary-sort, if you want to compute 
> different cumulative sums during the same job. It can be more or less 
> easy to implement depending on which API/library/tool you are using. 
> Ted comments on performance are spot on.
>
> Regards
>
> Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 
> <java8964@hotmail.com <ma...@hotmail.com>> wrote:
>
>     I did the cumulative sum in the HIVE UDF, as one of the project
>     for my employer.
>
>     1) You need to decide the grouping elements for your cumulative.
>     For example, an account, a department etc. In the mapper, combine
>     these information as your omit key.
>     2) If you don't have any grouping requirement, you just want a
>     cumulative sum for all your data, then send all the data to one
>     common key, so they will all go to the same reducer.
>     3) When you calculate the cumulative sum, does the output need to
>     have a sorting order? If so, you need to do the 2nd sorting, so
>     the data will be sorted as the order you want in the reducer.
>     4) In the reducer, just do the sum, omit every value per original
>     record (Not per key).
>
>     I will suggest you do this in the UDF of HIVE, as it is much easy,
>     if you can build a HIVE schema on top of your data.
>
>     Yong
>
>     ------------------------------------------------------------------------
>     From: tdunning@maprtech.com <ma...@maprtech.com>
>     Date: Thu, 4 Oct 2012 18:52:09 +0100
>     Subject: Re: Cumulative value using mapreduce
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>
>
>     Bertrand is almost right.
>
>     The only difference is that the original poster asked about
>     cumulative sum.
>
>     This can be done in reducer exactly as Bertrand described except
>     for two points that make it different from word count:
>
>     a) you can't use a combiner
>
>     b) the output of the program is as large as the input so it will
>     have different performance characteristics than aggregation
>     programs like wordcount.
>
>     Bertrand's key recommendation to go read a book is the most
>     important advice.
>
>     On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux
>     <dechouxb@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi,
>
>         It sounds like a
>         1) group information by account
>         2) compute sum per account
>
>         If that not the case, you should precise a bit more about your
>         context.
>
>         This computing looks like a small variant of wordcount. If you
>         do not know how to do it, you should read books about Hadoop
>         MapReduce and/or online tutorial. Yahoo's is old but still a
>         nice read to begin with :
>         http://developer.yahoo.com/hadoop/tutorial/
>
>         Regards,
>
>         Bertrand
>
>
>         On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>         <sarathchandra.josyam@algofusiontech.com
>         <ma...@algofusiontech.com>> wrote:
>
>             Hi,
>
>             I have a file which has some financial transaction data.
>             Each transaction will have amount and a credit/debit
>             indicator.
>             I want to write a mapreduce program which computes
>             cumulative credit & debit amounts at each record
>             and append these values to the record before dumping into
>             the output file.
>
>             Is this possible? How can I achieve this? Where should i
>             put the logic of computing the cumulative values?
>
>             Regards,
>             Sarath.
>
>
>
>
>         -- 
>         Bertrand Dechoux
>
>
>
>
>
> -- 
> Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Thanks for all your responses. As suggested will go through the 
documentation once again.

But just to clarify, this is not my first map-reduce program. I've 
already written a map-reduce for our product which does filtering and 
transformation of the financial data. This is a new requirement we've 
got. I have also did the logic of calculating the cumulative sums. But 
the output is not coming as desired and I feel I'm not doing it right 
way and missing something. So thought of taking a quick help from the 
mailing list.

As an example, say we have records as below -
Txn ID
	Txn Date
	Cr/Dr Indicator
	Amount
1001
	9/22/2012
	CR
	1000
1002
	9/25/2012
	DR
	500
1003
	10/1/2012
	DR
	1500
1004
	10/4/2012
	CR
	2000


When this file passed the logic should append the below 2 columns to the 
output for each record above -
CR Cumulative Amount
	DR Cumulative Amount
1000
	0
1000
	500
1000
	2000
3000
	2000


Hope the problem is clear now. Please provide your suggestions on the 
approach to the solution.

Regards,
Sarath.

On Friday 05 October 2012 02:51 AM, Bertrand Dechoux wrote:
> I indeed didn't catch the cumulative sum part. Then I guess it begs 
> for what-is-often-called-a-secondary-sort, if you want to compute 
> different cumulative sums during the same job. It can be more or less 
> easy to implement depending on which API/library/tool you are using. 
> Ted comments on performance are spot on.
>
> Regards
>
> Bertrand
>
> On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 
> <java8964@hotmail.com <ma...@hotmail.com>> wrote:
>
>     I did the cumulative sum in the HIVE UDF, as one of the project
>     for my employer.
>
>     1) You need to decide the grouping elements for your cumulative.
>     For example, an account, a department etc. In the mapper, combine
>     these information as your omit key.
>     2) If you don't have any grouping requirement, you just want a
>     cumulative sum for all your data, then send all the data to one
>     common key, so they will all go to the same reducer.
>     3) When you calculate the cumulative sum, does the output need to
>     have a sorting order? If so, you need to do the 2nd sorting, so
>     the data will be sorted as the order you want in the reducer.
>     4) In the reducer, just do the sum, omit every value per original
>     record (Not per key).
>
>     I will suggest you do this in the UDF of HIVE, as it is much easy,
>     if you can build a HIVE schema on top of your data.
>
>     Yong
>
>     ------------------------------------------------------------------------
>     From: tdunning@maprtech.com <ma...@maprtech.com>
>     Date: Thu, 4 Oct 2012 18:52:09 +0100
>     Subject: Re: Cumulative value using mapreduce
>     To: user@hadoop.apache.org <ma...@hadoop.apache.org>
>
>
>     Bertrand is almost right.
>
>     The only difference is that the original poster asked about
>     cumulative sum.
>
>     This can be done in reducer exactly as Bertrand described except
>     for two points that make it different from word count:
>
>     a) you can't use a combiner
>
>     b) the output of the program is as large as the input so it will
>     have different performance characteristics than aggregation
>     programs like wordcount.
>
>     Bertrand's key recommendation to go read a book is the most
>     important advice.
>
>     On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux
>     <dechouxb@gmail.com <ma...@gmail.com>> wrote:
>
>         Hi,
>
>         It sounds like a
>         1) group information by account
>         2) compute sum per account
>
>         If that not the case, you should precise a bit more about your
>         context.
>
>         This computing looks like a small variant of wordcount. If you
>         do not know how to do it, you should read books about Hadoop
>         MapReduce and/or online tutorial. Yahoo's is old but still a
>         nice read to begin with :
>         http://developer.yahoo.com/hadoop/tutorial/
>
>         Regards,
>
>         Bertrand
>
>
>         On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>         <sarathchandra.josyam@algofusiontech.com
>         <ma...@algofusiontech.com>> wrote:
>
>             Hi,
>
>             I have a file which has some financial transaction data.
>             Each transaction will have amount and a credit/debit
>             indicator.
>             I want to write a mapreduce program which computes
>             cumulative credit & debit amounts at each record
>             and append these values to the record before dumping into
>             the output file.
>
>             Is this possible? How can I achieve this? Where should i
>             put the logic of computing the cumulative values?
>
>             Regards,
>             Sarath.
>
>
>
>
>         -- 
>         Bertrand Dechoux
>
>
>
>
>
> -- 
> Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

I indeed didn't catch the cumulative sum part. Then I guess it begs for
what-is-often-called-a-secondary-sort, if you want to compute different
cumulative sums during the same job. It can be more or less easy to
implement depending on which API/library/tool you are using. Ted comments
on performance are spot on.

Regards

Bertrand

On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  I did the cumulative sum in the HIVE UDF, as one of the project for my
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For
> example, an account, a department etc. In the mapper, combine these
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a cumulative
> sum for all your data, then send all the data to one common key, so they
> will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have a
> sorting order? If so, you need to do the 2nd sorting, so the data will be
> sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original record
> (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if you
> can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative sum.
>
> This can be done in reducer exactly as Bertrand described except for two
> points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have
> different performance characteristics than aggregation programs like
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

I indeed didn't catch the cumulative sum part. Then I guess it begs for
what-is-often-called-a-secondary-sort, if you want to compute different
cumulative sums during the same job. It can be more or less easy to
implement depending on which API/library/tool you are using. Ted comments
on performance are spot on.

Regards

Bertrand

On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  I did the cumulative sum in the HIVE UDF, as one of the project for my
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For
> example, an account, a department etc. In the mapper, combine these
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a cumulative
> sum for all your data, then send all the data to one common key, so they
> will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have a
> sorting order? If so, you need to do the 2nd sorting, so the data will be
> sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original record
> (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if you
> can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative sum.
>
> This can be done in reducer exactly as Bertrand described except for two
> points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have
> different performance characteristics than aggregation programs like
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

I indeed didn't catch the cumulative sum part. Then I guess it begs for
what-is-often-called-a-secondary-sort, if you want to compute different
cumulative sums during the same job. It can be more or less easy to
implement depending on which API/library/tool you are using. Ted comments
on performance are spot on.

Regards

Bertrand

On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  I did the cumulative sum in the HIVE UDF, as one of the project for my
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For
> example, an account, a department etc. In the mapper, combine these
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a cumulative
> sum for all your data, then send all the data to one common key, so they
> will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have a
> sorting order? If so, you need to do the 2nd sorting, so the data will be
> sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original record
> (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if you
> can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative sum.
>
> This can be done in reducer exactly as Bertrand described except for two
> points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have
> different performance characteristics than aggregation programs like
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Sarath <sa...@algofusiontech.com>.

Hi Yong,

Could you share more details about the HIVE UDF you have written for 
this use case?
As suggested, I would like to try this approach and see if that 
simplifies the solution to my requirement.

~Sarath.


On Friday 05 October 2012 12:32 AM, java8964 java8964 wrote:
> I did the cumulative sum in the HIVE UDF, as one of the project for my 
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For 
> example, an account, a department etc. In the mapper, combine these 
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a 
> cumulative sum for all your data, then send all the data to one common 
> key, so they will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have 
> a sorting order? If so, you need to do the 2nd sorting, so the data 
> will be sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original 
> record (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if 
> you can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------------------------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative 
> sum.
>
> This can be done in reducer exactly as Bertrand described except for 
> two points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have 
> different performance characteristics than aggregation programs like 
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important 
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <dechouxb@gmail.com 
> <ma...@gmail.com>> wrote:
>
>     Hi,
>
>     It sounds like a
>     1) group information by account
>     2) compute sum per account
>
>     If that not the case, you should precise a bit more about your
>     context.
>
>     This computing looks like a small variant of wordcount. If you do
>     not know how to do it, you should read books about Hadoop
>     MapReduce and/or online tutorial. Yahoo's is old but still a nice
>     read to begin with : http://developer.yahoo.com/hadoop/tutorial/
>
>     Regards,
>
>     Bertrand
>
>
>     On Thu, Oct 4, 2012 at 3:58 PM, Sarath
>     <sarathchandra.josyam@algofusiontech.com
>     <ma...@algofusiontech.com>> wrote:
>
>         Hi,
>
>         I have a file which has some financial transaction data. Each
>         transaction will have amount and a credit/debit indicator.
>         I want to write a mapreduce program which computes cumulative
>         credit & debit amounts at each record
>         and append these values to the record before dumping into the
>         output file.
>
>         Is this possible? How can I achieve this? Where should i put
>         the logic of computing the cumulative values?
>
>         Regards,
>         Sarath.
>
>
>
>
>     -- 
>     Bertrand Dechoux
>
>

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

I indeed didn't catch the cumulative sum part. Then I guess it begs for
what-is-often-called-a-secondary-sort, if you want to compute different
cumulative sums during the same job. It can be more or less easy to
implement depending on which API/library/tool you are using. Ted comments
on performance are spot on.

Regards

Bertrand

On Thu, Oct 4, 2012 at 9:02 PM, java8964 java8964 <ja...@hotmail.com>wrote:

>  I did the cumulative sum in the HIVE UDF, as one of the project for my
> employer.
>
> 1) You need to decide the grouping elements for your cumulative. For
> example, an account, a department etc. In the mapper, combine these
> information as your omit key.
> 2) If you don't have any grouping requirement, you just want a cumulative
> sum for all your data, then send all the data to one common key, so they
> will all go to the same reducer.
> 3) When you calculate the cumulative sum, does the output need to have a
> sorting order? If so, you need to do the 2nd sorting, so the data will be
> sorted as the order you want in the reducer.
> 4) In the reducer, just do the sum, omit every value per original record
> (Not per key).
>
> I will suggest you do this in the UDF of HIVE, as it is much easy, if you
> can build a HIVE schema on top of your data.
>
> Yong
>
> ------------------------------
> From: tdunning@maprtech.com
> Date: Thu, 4 Oct 2012 18:52:09 +0100
> Subject: Re: Cumulative value using mapreduce
> To: user@hadoop.apache.org
>
>
> Bertrand is almost right.
>
> The only difference is that the original poster asked about cumulative sum.
>
> This can be done in reducer exactly as Bertrand described except for two
> points that make it different from word count:
>
> a) you can't use a combiner
>
> b) the output of the program is as large as the input so it will have
> different performance characteristics than aggregation programs like
> wordcount.
>
> Bertrand's key recommendation to go read a book is the most important
> advice.
>
> On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com>wrote:
>
> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>
>
>
>
> --
> Bertrand Dechoux
>
>
>


-- 
Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

I did the cumulative sum in the HIVE UDF, as one of the project for my employer.
1) You need to decide the grouping elements for your cumulative. For example, an account, a department etc. In the mapper, combine these information as your omit key.2) If you don't have any grouping requirement, you just want a cumulative sum for all your data, then send all the data to one common key, so they will all go to the same reducer.3) When you calculate the cumulative sum, does the output need to have a sorting order? If so, you need to do the 2nd sorting, so the data will be sorted as the order you want in the reducer.4) In the reducer, just do the sum, omit every value per original record (Not per key).
I will suggest you do this in the UDF of HIVE, as it is much easy, if you can build a HIVE schema on top of your data.
Yong

From: tdunning@maprtech.com
Date: Thu, 4 Oct 2012 18:52:09 +0100
Subject: Re: Cumulative value using mapreduce
To: user@hadoop.apache.org

Bertrand is almost right.
The only difference is that the original poster asked about cumulative sum.
This can be done in reducer exactly as Bertrand described except for two points that make it different from word count:

a) you can't use a combiner
b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount.

Bertrand's key recommendation to go read a book is the most important advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

Hi,
It sounds like a1) group information by account2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know how to do it, you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/

Regards,
Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <sa...@algofusiontech.com> wrote:

Hi,

I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator.

I want to write a mapreduce program which computes cumulative credit & debit amounts at each record

and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values?

Regards,

Sarath.

-- 
Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

I did the cumulative sum in the HIVE UDF, as one of the project for my employer.
1) You need to decide the grouping elements for your cumulative. For example, an account, a department etc. In the mapper, combine these information as your omit key.2) If you don't have any grouping requirement, you just want a cumulative sum for all your data, then send all the data to one common key, so they will all go to the same reducer.3) When you calculate the cumulative sum, does the output need to have a sorting order? If so, you need to do the 2nd sorting, so the data will be sorted as the order you want in the reducer.4) In the reducer, just do the sum, omit every value per original record (Not per key).
I will suggest you do this in the UDF of HIVE, as it is much easy, if you can build a HIVE schema on top of your data.
Yong

From: tdunning@maprtech.com
Date: Thu, 4 Oct 2012 18:52:09 +0100
Subject: Re: Cumulative value using mapreduce
To: user@hadoop.apache.org

Bertrand is almost right.
The only difference is that the original poster asked about cumulative sum.
This can be done in reducer exactly as Bertrand described except for two points that make it different from word count:

a) you can't use a combiner
b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount.

Bertrand's key recommendation to go read a book is the most important advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

Hi,
It sounds like a1) group information by account2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know how to do it, you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/

Regards,
Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <sa...@algofusiontech.com> wrote:

Hi,

I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator.

I want to write a mapreduce program which computes cumulative credit & debit amounts at each record

and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values?

Regards,

Sarath.

-- 
Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

I did the cumulative sum in the HIVE UDF, as one of the project for my employer.
1) You need to decide the grouping elements for your cumulative. For example, an account, a department etc. In the mapper, combine these information as your omit key.2) If you don't have any grouping requirement, you just want a cumulative sum for all your data, then send all the data to one common key, so they will all go to the same reducer.3) When you calculate the cumulative sum, does the output need to have a sorting order? If so, you need to do the 2nd sorting, so the data will be sorted as the order you want in the reducer.4) In the reducer, just do the sum, omit every value per original record (Not per key).
I will suggest you do this in the UDF of HIVE, as it is much easy, if you can build a HIVE schema on top of your data.
Yong

From: tdunning@maprtech.com
Date: Thu, 4 Oct 2012 18:52:09 +0100
Subject: Re: Cumulative value using mapreduce
To: user@hadoop.apache.org

Bertrand is almost right.
The only difference is that the original poster asked about cumulative sum.
This can be done in reducer exactly as Bertrand described except for two points that make it different from word count:

a) you can't use a combiner
b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount.

Bertrand's key recommendation to go read a book is the most important advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

Hi,
It sounds like a1) group information by account2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know how to do it, you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/

Regards,
Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <sa...@algofusiontech.com> wrote:

Hi,

I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator.

I want to write a mapreduce program which computes cumulative credit & debit amounts at each record

and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values?

Regards,

Sarath.

-- 
Bertrand Dechoux

RE: Cumulative value using mapreduce

Posted by java8964 java8964 <ja...@hotmail.com>.

I did the cumulative sum in the HIVE UDF, as one of the project for my employer.
1) You need to decide the grouping elements for your cumulative. For example, an account, a department etc. In the mapper, combine these information as your omit key.2) If you don't have any grouping requirement, you just want a cumulative sum for all your data, then send all the data to one common key, so they will all go to the same reducer.3) When you calculate the cumulative sum, does the output need to have a sorting order? If so, you need to do the 2nd sorting, so the data will be sorted as the order you want in the reducer.4) In the reducer, just do the sum, omit every value per original record (Not per key).
I will suggest you do this in the UDF of HIVE, as it is much easy, if you can build a HIVE schema on top of your data.
Yong

From: tdunning@maprtech.com
Date: Thu, 4 Oct 2012 18:52:09 +0100
Subject: Re: Cumulative value using mapreduce
To: user@hadoop.apache.org

Bertrand is almost right.
The only difference is that the original poster asked about cumulative sum.
This can be done in reducer exactly as Bertrand described except for two points that make it different from word count:

a) you can't use a combiner
b) the output of the program is as large as the input so it will have different performance characteristics than aggregation programs like wordcount.

Bertrand's key recommendation to go read a book is the most important advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

Hi,
It sounds like a1) group information by account2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know how to do it, you should read books about Hadoop MapReduce and/or online tutorial. Yahoo's is old but still a nice read to begin with : http://developer.yahoo.com/hadoop/tutorial/

Regards,
Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <sa...@algofusiontech.com> wrote:

Hi,

I have a file which has some financial transaction data. Each transaction will have amount and a credit/debit indicator.

I want to write a mapreduce program which computes cumulative credit & debit amounts at each record

and append these values to the record before dumping into the output file.

Is this possible? How can I achieve this? Where should i put the logic of computing the cumulative values?

Regards,

Sarath.

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Bertrand is almost right.

The only difference is that the original poster asked about cumulative sum.

This can be done in reducer exactly as Bertrand described except for two
points that make it different from word count:

a) you can't use a combiner

b) the output of the program is as large as the input so it will have
different performance characteristics than aggregation programs like
wordcount.

Bertrand's key recommendation to go read a book is the most important
advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Bertrand is almost right.

The only difference is that the original poster asked about cumulative sum.

This can be done in reducer exactly as Bertrand described except for two
points that make it different from word count:

a) you can't use a combiner

b) the output of the program is as large as the input so it will have
different performance characteristics than aggregation programs like
wordcount.

Bertrand's key recommendation to go read a book is the most important
advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Bertrand is almost right.

The only difference is that the original poster asked about cumulative sum.

This can be done in reducer exactly as Bertrand described except for two
points that make it different from word count:

a) you can't use a combiner

b) the output of the program is as large as the input so it will have
different performance characteristics than aggregation programs like
wordcount.

Bertrand's key recommendation to go read a book is the most important
advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: Cumulative value using mapreduce

Posted by Ted Dunning <td...@maprtech.com>.

Bertrand is almost right.

The only difference is that the original poster asked about cumulative sum.

This can be done in reducer exactly as Bertrand described except for two
points that make it different from word count:

a) you can't use a combiner

b) the output of the program is as large as the input so it will have
different performance characteristics than aggregation programs like
wordcount.

Bertrand's key recommendation to go read a book is the most important
advice.

On Thu, Oct 4, 2012 at 5:20 PM, Bertrand Dechoux <de...@gmail.com> wrote:

> Hi,
>
> It sounds like a
> 1) group information by account
> 2) compute sum per account
>
> If that not the case, you should precise a bit more about your context.
>
> This computing looks like a small variant of wordcount. If you do not know
> how to do it, you should read books about Hadoop MapReduce and/or online
> tutorial. Yahoo's is old but still a nice read to begin with :
> http://developer.yahoo.com/hadoop/tutorial/
>
> Regards,
>
> Bertrand
>
>
> On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
> sarathchandra.josyam@algofusiontech.com> wrote:
>
>> Hi,
>>
>> I have a file which has some financial transaction data. Each transaction
>> will have amount and a credit/debit indicator.
>> I want to write a mapreduce program which computes cumulative credit &
>> debit amounts at each record
>> and append these values to the record before dumping into the output file.
>>
>> Is this possible? How can I achieve this? Where should i put the logic of
>> computing the cumulative values?
>>
>> Regards,
>> Sarath.
>>
>
>
>
> --
> Bertrand Dechoux
>

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

It sounds like a
1) group information by account
2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know
how to do it, you should read books about Hadoop MapReduce and/or online
tutorial. Yahoo's is old but still a nice read to begin with :
http://developer.yahoo.com/hadoop/tutorial/

Regards,

Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

It sounds like a
1) group information by account
2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know
how to do it, you should read books about Hadoop MapReduce and/or online
tutorial. Yahoo's is old but still a nice read to begin with :
http://developer.yahoo.com/hadoop/tutorial/

Regards,

Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

It sounds like a
1) group information by account
2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know
how to do it, you should read books about Hadoop MapReduce and/or online
tutorial. Yahoo's is old but still a nice read to begin with :
http://developer.yahoo.com/hadoop/tutorial/

Regards,

Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>

-- 
Bertrand Dechoux

Re: Cumulative value using mapreduce

Posted by Bertrand Dechoux <de...@gmail.com>.

Hi,

It sounds like a
1) group information by account
2) compute sum per account

If that not the case, you should precise a bit more about your context.

This computing looks like a small variant of wordcount. If you do not know
how to do it, you should read books about Hadoop MapReduce and/or online
tutorial. Yahoo's is old but still a nice read to begin with :
http://developer.yahoo.com/hadoop/tutorial/

Regards,

Bertrand

On Thu, Oct 4, 2012 at 3:58 PM, Sarath <
sarathchandra.josyam@algofusiontech.com> wrote:

> Hi,
>
> I have a file which has some financial transaction data. Each transaction
> will have amount and a credit/debit indicator.
> I want to write a mapreduce program which computes cumulative credit &
> debit amounts at each record
> and append these values to the record before dumping into the output file.
>
> Is this possible? How can I achieve this? Where should i put the logic of
> computing the cumulative values?
>
> Regards,
> Sarath.
>

-- 
Bertrand Dechoux