You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Justin Cohen <ju...@teamaol.com> on 2010/09/13 23:43:16 UTC
Tuning simple count m/r job
I have a table with 82 regions and about 44 million rows. It takes
almost 6 minutes to count with map reduce. Is that a reasonable rate for
a ten machine cluster of data nodes? That's just over 12,000 rows per
second per machineā¦. Can I do better? Right now the only custom thing I
am doing is setting scan.setCaching to 10,000. There's one gz column per
row, but I just want to count rows, not decompress the columns...
Is each map task assigned to each region? Some map tasks only have a few
thousand rows. Others have over 2 million. Does this mean the regions
aren't balanced, or does it also take into account size of columns with
number of rows.
Thanks,
Justin