You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Justin Cohen <ju...@teamaol.com> on 2010/09/13 23:43:16 UTC

Tuning simple count m/r job

  I have a table with 82 regions and about 44 million rows. It takes 
almost 6 minutes to count with map reduce. Is that a reasonable rate for 
a ten machine cluster of data nodes? That's just over 12,000 rows per 
second per machineā€¦. Can I do better? Right now the only custom thing I 
am doing is setting scan.setCaching to 10,000. There's one gz column per 
row, but I just want to count rows, not decompress the columns...

Is each map task assigned to each region? Some map tasks only have a few 
thousand rows. Others have over 2 million. Does this mean the regions 
aren't balanced, or does it also take into account size of columns with 
number of rows.

Thanks,
Justin