You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Li Li <fa...@gmail.com> on 2014/05/14 03:45:43 UTC
last mapper of mapreduce on hbase very slow
I use two hbase tables as mapper input.
one is url table, the other is links between url
sample rows of url tabel: http://abc.com/index.htm, content1
http://abc.com/news/123.htm,content
sample rows of linkstable
http://abc.com/index.htm++http://abc.com/news/123.htm anchor1
mapper will aggregate url info and out edges' info of this url
in above example
key http://abc.com/index.html value iterator: <content1; anchor1>
key http://abc.com/news/123.html value iterator <content2>
my codes:
List<Scan> scans = new ArrayList<Scan>();
Scan urldbScan=new Scan();
urldbScan.setCaching(5000);
urldbScan.setCacheBlocks(false);
urldbScan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME,
HbaseTools.TB_URL_DB_BT);
urldbScan.addFamily(HbaseTools.CF_BT);
scans.add(urldbScan);
Scan outLinkScan=new Scan();
outLinkScan.setCaching(5000);
outLinkScan.setCacheBlocks(false);
outLinkScan.setAttribute(Scan.SCAN_ATTRIBUTES_TABLE_NAME,
HbaseTools.TB_OUT_LINK_BT);
outLinkScan.addFamily(HbaseTools.CF_BT);
scans.add(outLinkScan);
TableMapReduceUtil.initTableMapperJob(scans, Step1Mapper.class,
BytesWritable.class,
ScheduleData.class, job);
The last mapper is very slow(other mappers finished in about 20
minutes while the last one will use more than 40 minutes). it will run
another speculative task but the first one will finish first.
the mapper output something like:
2014-05-14 09:04:18,020 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 65401; kvend = 327545; length = 327680
2014-05-14 09:04:18,621 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 169
2014-05-14 09:04:56,698 INFO org.apache.hadoop.mapred.MapTask:
Spilling map output: record full = true
2014-05-14 09:04:56,699 INFO org.apache.hadoop.mapred.MapTask:
bufstart = 73854975; bufend = 86572670; bufvoid = 99614720
2014-05-14 09:04:56,699 INFO org.apache.hadoop.mapred.MapTask: kvstart
= 327545; kvend = 262008; length = 327680
2014-05-14 09:04:57,329 INFO org.apache.hadoop.mapred.MapTask:
Finished spill 170
the url row count is about 60,000,000 and link row count is about
210,000,000. both tables have 20 regions.
obviously the later is larger. if I add outLinkScan before urldbScan,
will it be faster?
I found that at the beginning, the map-reduce will make about 3,000
hbase requests per second. But when there is only the last mapper
running, the hbase requests is less than 200 per second.
some mapreduce counters:
map reduce total
Map-Reduce Framework
Map output materialized bytes 17,507,954,820 0 17,507,954,820
Map input records 272,044,972 0 272,044,972
Reduce shuffle bytes 0 16,417,268,500 16,417,268,500
Spilled Records 753,880,977 272,044,972 1,025,925,949
Map output bytes 16,954,700,625 0 16,954,700,625
CPU time spent (ms) 5,992,170 1,410,260 7,402,430
Total committed heap usage (bytes) 9,461,104,640 7,806,779,392
17,267,884,032
Combine input records 0 0 0
SPLIT_RAW_BYTES 5,876 0 5,876
Reduce input records 0 272,044,972 272,044,972
Reduce input groups 0 40,958,494 40,958,494
Combine output records 0 0 0
Physical memory (bytes) snapshot 12,862,795,776 8,684,433,408
21,547,229,184
Reduce output records 0 2,693,146 2,693,146
Virtual memory (bytes) snapshot 119,726,366,720 26,299,596,800
146,025,963,520
Map output records 272,044,972 0 272,044,972
HBase Counters
REMOTE_RPC_CALLS 369,863 0 369,863
RPC_CALLS 2,720,550 0 2,720,550
RPC_RETRIES 0 0 0
NOT_SERVING_REGION_EXCEPTION 0 0 0
MILLIS_BETWEEN_NEXTS 22,157,003 0 22,157,003
NUM_SCANNER_RESTARTS 0 0 0
BYTES_IN_RESULTS 61,232,515,133 0 61,232,515,133
BYTES_IN_REMOTE_RESULTS 7,287,035,830 0 7,287,035,830
REMOTE_RPC_RETRIES 0 0 0
REGIONS_SCANNED 40 0 40