You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Li Li <fa...@gmail.com> on 2014/05/14 03:45:43 UTC
last mapper of mapreduce on hbase very slow

I use two hbase tables as mapper input.
one is url table, the other is links between url
sample rows of url tabel: http://abc.com/index.htm, content1
http://abc.com/news/123.htm,content
sample rows of linkstable
http://abc.com/index.htm++http://abc.com/news/123.htm anchor1
mapper will aggregate url info and out edges' info of this url
in above example
key http://abc.com/index.html value iterator: <content1; anchor1>
key http://abc.com/news/123.html value iterator <content2>

my codes:
List<Scan> scans = new ArrayList<Scan>();

Scan urldbScan=new Scan();
urldbScan.setCaching(5000);
urldbScan.setCacheBlocks(false);
urldbScan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME,
HbaseTools.TB_URL_DB_BT);
urldbScan.addFamily(HbaseTools.CF_BT);
scans.add(urldbScan);

Scan outLinkScan=new Scan();
outLinkScan.setCaching(5000);
outLinkScan.setCacheBlocks(false);
outLinkScan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME,
HbaseTools.TB_OUT_LINK_BT);
outLinkScan.addFamily(HbaseTools.CF_BT);
scans.add(outLinkScan);

TableMapReduceUtil.initTableMapperJob(scans, Step1Mapper.class,
BytesWritable.class,
ScheduleData.class, job);

The last mapper is very slow(other mappers finished in about 20
minutes while the last one will use more than 40 minutes). it will run
another speculative task but the first one will finish first.
the mapper output something like:
2014-05-14 09:04:18,020 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 65401; kvend = 327545; length = 327680
2014-05-14 09:04:18,621 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 169
2014-05-14 09:04:56,698 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2014-05-14 09:04:56,699 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 73854975; bufend = 86572670; bufvoid = 99614720
2014-05-14 09:04:56,699 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 327545; kvend = 262008; length = 327680
2014-05-14 09:04:57,329 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 170

the url row count is about 60,000,000 and link row count is about
210,000,000. both tables have 20 regions.
obviously the later is larger. if I add outLinkScan before urldbScan,
will it be faster?
I found that at the beginning, the map-reduce will make about 3,000
hbase requests per second. But when there is only the last mapper
running, the hbase requests is less than 200 per second.

some mapreduce counters:
                                      map    reduce     total
Map-Reduce Framework
       Map output materialized bytes 17,507,954,820 0 17,507,954,820
       Map input records 272,044,972 0 272,044,972
       Reduce shuffle bytes 0 16,417,268,500 16,417,268,500
       Spilled Records 753,880,977 272,044,972 1,025,925,949
       Map output bytes 16,954,700,625 0 16,954,700,625
       CPU time spent (ms) 5,992,170 1,410,260 7,402,430
       Total committed heap usage (bytes) 9,461,104,640 7,806,779,392
17,267,884,032
       Combine input records 0 0 0
       SPLIT_RAW_BYTES 5,876 0 5,876
       Reduce input records 0 272,044,972 272,044,972
       Reduce input groups 0 40,958,494 40,958,494
       Combine output records 0 0 0
       Physical memory (bytes) snapshot 12,862,795,776 8,684,433,408
21,547,229,184
       Reduce output records 0 2,693,146 2,693,146
       Virtual memory (bytes) snapshot 119,726,366,720 26,299,596,800
146,025,963,520
       Map output records 272,044,972 0 272,044,972

HBase Counters
       REMOTE_RPC_CALLS 369,863 0 369,863
       RPC_CALLS 2,720,550 0 2,720,550
       RPC_RETRIES 0 0 0
       NOT_SERVING_REGION_EXCEPTION 0 0 0
       MILLIS_BETWEEN_NEXTS 22,157,003 0 22,157,003
       NUM_SCANNER_RESTARTS 0 0 0
       BYTES_IN_RESULTS 61,232,515,133 0 61,232,515,133
       BYTES_IN_REMOTE_RESULTS 7,287,035,830 0 7,287,035,830
       REMOTE_RPC_RETRIES 0 0 0
       REGIONS_SCANNED 40 0 40