You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hbase.apache.org by "Zheng Hu (JIRA)" <ji...@apache.org> on 2018/12/26 09:10:00 UTC
[jira] [Commented] (HBASE-21642) CopyTable by reading snapshot and bulkloading will save a lot of time.

    [ https://issues.apache.org/jira/browse/HBASE-21642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16728939#comment-16728939 ] 

Zheng Hu commented on HBASE-21642:
----------------------------------

When running  copyTable on Mob table by scan snapshot, I found : 
{code}
2018-12-26 16:52:51,088 DEBUG [LocalJobRunner Map Task Executor #0] ipc.AbstractRpcClient(483): Stopping rpc client
2018-12-26 16:52:51,095 WARN  [Thread-1048] mapred.LocalJobRunner$Job(560): job_local2134482229_0002
java.lang.Exception: java.lang.NullPointerException
        at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.hbase.regionserver.HMobStore.readCell(HMobStore.java:409)
        at org.apache.hadoop.hbase.regionserver.HMobStore.resolve(HMobStore.java:346)
        at org.apache.hadoop.hbase.regionserver.MobStoreScanner.next(MobStoreScanner.java:73)
        at org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:153)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:6631)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6795)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6568)
        at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:6554)
        at org.apache.hadoop.hbase.client.ClientSideRegionScanner.next(ClientSideRegionScanner.java:77)
        at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormatImpl$RecordReader.nextKeyValue(TableSnapshotInputFormatImpl.java:241)
        at org.apache.hadoop.hbase.mapreduce.TableSnapshotInputFormat$TableSnapshotRegionRecordReader.nextKeyValue(TableSnapshotInputFormat.java:166)
        at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:556)
        at org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80)
        at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:243)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
{code}
It's a bug when scaning snapshot of mob table...

> CopyTable by reading snapshot and bulkloading will save a lot of time.
> ----------------------------------------------------------------------
>
>                 Key: HBASE-21642
>                 URL: https://issues.apache.org/jira/browse/HBASE-21642
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Zheng Hu
>            Assignee: Zheng Hu
>            Priority: Major
>
> In our HBase clusters,  some users has the need to merge two diff table's data into one.  Currently ,  the CopyTable will scan the source table , and put mutations into destination table. 
> Although CopyTable with bulkload can speed a lot (compared to CopyTable with scan and put), it still take lots of time to scan the source table.  and the worst thing is:  CopyTable with scan table will impact the cluster's availablity, it cost lots of resource in RS to scanning,  the cpu,  memory, gc stw,  rs handlers, disk io, network io ... etc.  All those things will affect the availablity. 
> So in our clusters,  we tried to do all scanning job by using scan snapshot instead of scan table.  it at least isolate the cpu & memory & gc resource  between the online RS and scanning job. What's more,  the snapshot scanning is much faster than scaning RS, and it's more stable.
> So, here,  I'll make the copy table tool support snapshot scanning. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)