You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-dev@hadoop.apache.org by James Bond <bo...@gmail.com> on 2015/06/17 06:40:42 UTC

Hadoop MapReduce - Sum and Sort by value question

One of my friends was asked this question on hadoop MapReduce - We have
multiple stores and each stores have many customers visiting and buying
stuff. the dataset consists of "Store#, Customer#, Quantity purchased".
Need a MapReduce code to get the Top 2 customers for each store.

The solution which i thought of was to do a secondary sort on qty (in
descending order - store + qty makes the composite key) and in the reducer
just display first 2 values (or customers) for each Key (store + qty, qty
is part of composite key). This works if the customer is unique, but if the
customer has visited the same store multiple times then how do we do it?
This would require summing up of the quantities for each customer and then
getting the top 2.

The solution which i thought was -

SOLUTION 1

1. Key = Store + Customer

2. Make sure each store goes to the same reducer (Secondary sort/Key
comparator etc).

In Reducer -

3. Sum qty for each customer.

4. Add result in Map

5. Loop thru Map and find the top 2 (or use a Treemap to sort based on sum
of qty).

SOLUTION 2

or the other solution is to write 2 MapRed jobs which runs one after the
other. The first one to get a sum of qty purchased for each customer and
store. The second MapRed to sort by qty and get the top 2 buyers.

The question is -

Is there any other way of achieving this?

In solution 1, If i use a Map and a TreeMap and then looping thru them for
getting results will it be  a problem if we have millions if customers?
Should i be worried about keeping all of them in memory?

If the sorting/looping is really costly, any other way you would advise?

Sample Data:

STORE1,CUST1,100

STORE1,CUST2,10

STORE1,CUST1,40

STORE1,CUST2,27

STORE1,CUST3,63

STORE1,CUST1,33

STORE1,CUST2,12

STORE2,CUST4,24

STORE2,CUST6,234

STORE2,CUST4,3

STORE2,CUST5,34

STORE2,CUST6,234

STORE2,CUST4,234

Output -
STORE1,CUST1,173
STORE1,CUST3,63
STORE2,CUST6,468
STORE2,CUST4,261

Thanks!