You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Tao Xiao <xi...@gmail.com> on 2014/09/30 04:21:23 UTC

Reading from HBase is too slow

I submitted a job in Yarn-Client mode, which simply reads from a HBase
table containing tens of millions of records and then does a *count *action.
The job runs for a much longer time than I expected, so I wonder whether it
was because the data to read was too much. Actually, there are 20 nodes in
my Hadoop cluster so the HBase table seems not so big (tens of millopns of
records). :

I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).

BTW, when the job was running, I can see logs on the console, and
specifically I'd like to know what the following log means:

14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
13454 bytes in 0 ms
14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
ms on b04.jsepc.com (progress: 18/86)
14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)


Thanks

Re: Reading from HBase is too slow

Posted by Russ Weeks <rw...@newbrightidea.com>.
Hi, Tao,
When I used newAPIHadoopRDD (Accumulo not HBase) I found that I had to
specify executor-memory and num-executors explicitly on the command line or
else I didn't get any parallelism across the cluster.

I used  --executor-memory 3G --num-executors 24 but obviously other
parameters will be better for your cluster.

-Russ

On Mon, Sep 29, 2014 at 7:43 PM, Nan Zhu <zh...@gmail.com> wrote:

> can you look at your HBase UI to check whether your job is just reading
> from a single region server?
>
> Best,
>
> --
> Nan Zhu
>
> On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:
>
> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>
>
>

Re: Reading from HBase is too slow

Posted by Nan Zhu <zh...@gmail.com>.
can you look at your HBase UI to check whether your job is just reading from a single region server? 

Best, 

-- 
Nan Zhu


On Monday, September 29, 2014 at 10:21 PM, Tao Xiao wrote:

> I submitted a job in Yarn-Client mode, which simply reads from a HBase table containing tens of millions of records and then does a count action. The job runs for a much longer time than I expected, so I wonder whether it was because the data to read was too much. Actually, there are 20 nodes in my Hadoop cluster so the HBase table seems not so big (tens of millopns of records). :
> 
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
> 
> BTW, when the job was running, I can see logs on the console, and specifically I'd like to know what the following log means:
> 
> > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as TID 20 on executor 2: b04.jsepc.com (http://b04.jsepc.com) (PROCESS_LOCAL)
> > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as 13454 bytes in 0 ms
> > 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426 ms on b04.jsepc.com (http://b04.jsepc.com) (progress: 18/86)
> > 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
> > 
> 
> Thanks 


Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
I submitted the job in Yarn-Client mode using the following script:

export
SPARK_JAR=/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar

export HADOOP_CLASSPATH=$(hbase classpath)
export
CLASSPATH=$CLASSPATH:/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar:/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar:/usr/games/spark/xt/hadoop-common-2.3.0-cdh5.0.1.jar:/usr/games/spark/xt/hbase-client-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-common-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-server-0.96.1.1-cdh5.0.1.jar:/usr/games/spark/xt/hbase-protocol-0.96.0-hadoop2.jar:/usr/games/spark/xt/htrace-core-2.01.jar:$HADOOP_CLASSPATH

CONFIG_OPTS="-Dspark.master=yarn-client
-Dspark.jars=/usr/games/spark/xt/SparkDemo-0.0.1-SNAPSHOT.jar,/usr/games/spark/xt/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar,/usr/games/spark/xt/hbase-client-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-common-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-server-0.96.1.1-cdh5.0.1.jar,/usr/games/spark/xt/hbase-protocol-0.96.0-hadoop2.jar,/usr/games/spark/xt/htrace-core-2.01.jar"

java -cp $CLASSPATH $CONFIG_OPTS com.xt.scala.TestSpark




My job's code is as follows:


object TestSpark {
  def main(args: Array[String]) {
    readHBase("C_CONS")
  }

  def readHBase(tableName: String) {
    val hbaseConf = HBaseConfiguration.create()
    hbaseConf.set(TableInputFormat.INPUT_TABLE, tableName)

    val sparkConf = new SparkConf()
        .setAppName("<<< Reading HBase >>>")
    val sc = new SparkContext(sparkConf)

    val rdd = sc.newAPIHadoopRDD(hbaseConf, classOf[TableInputFormat],
               classOf[ImmutableBytesWritable], classOf[Result])

    println(rdd.count)

  }
}


2014-09-30 10:21 GMT+08:00 Tao Xiao <xi...@gmail.com>:

> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>

Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
Sean,

I did specify the number of cores to use as follows:

... ...
val sparkConf = new SparkConf()
        .setAppName("<<< Reading HBase >>>")
        .set("spark.cores.max", "32")
val sc = new SparkContext(sparkConf)
... ...



But that does not solve the problem --- only 2 workers are allocated.

I'm using Spark 0.9 and submitting my job through Yarn client mode.
Actually, setting *spark.cores.max* only applies when the job runs on
a *standalone
deploy cluster *or a  *Mesos cluster in "coarse-grained" sharing mode*.
Please refer to this link
<http://spark.apache.org/docs/0.9.1/configuration.html>

So how to specify the number of executors when submitting a Spark 0.9 job
in Yarn Client mode?

2014-10-08 15:09 GMT+08:00 Sean Owen <so...@cloudera.com>:

> You do need to specify the number of executor cores to use. Executors are
> not like mappers. After all they may do much more in their lifetime than
> just read splits from HBase so would not make sense to determine it by
> something that the first line of the program does.
> On Oct 8, 2014 8:00 AM, "Tao Xiao" <xi...@gmail.com> wrote:
>
>> Hi Sean,
>>
>>    Do I need to specify the number of executors when submitting the job?
>> I suppose the number of executors will be determined by the number of
>> regions of the table. Just like a MapReduce job, you needn't specify the
>> number of map tasks when reading from a HBase table.
>>
>>   The script to submit my job can be seen in my second post. Please refer
>> to that.
>>
>>
>>
>> 2014-10-08 13:44 GMT+08:00 Sean Owen <so...@cloudera.com>:
>>
>>> How did you run your program? I don't see from your earlier post that
>>> you ever asked for more executors.
>>>
>>> On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao <xi...@gmail.com>
>>> wrote:
>>> > I found the reason why reading HBase is too slow.  Although each
>>> > regionserver serves multiple regions for the table I'm reading, the
>>> number
>>> > of Spark workers allocated by Yarn is too low. Actually, I could see
>>> that
>>> > the table has dozens of regions spread over about 20 regionservers,
>>> but only
>>> > two Spark workers are allocated by Yarn. What is worse, the two
>>> workers run
>>> > one after one. So, the Spark job lost parallelism.
>>> >
>>> > So now the question is : Why are only 2 workers allocated?
>>>
>>
>>

Re: Reading from HBase is too slow

Posted by Sean Owen <so...@cloudera.com>.
You do need to specify the number of executor cores to use. Executors are
not like mappers. After all they may do much more in their lifetime than
just read splits from HBase so would not make sense to determine it by
something that the first line of the program does.
On Oct 8, 2014 8:00 AM, "Tao Xiao" <xi...@gmail.com> wrote:

> Hi Sean,
>
>    Do I need to specify the number of executors when submitting the job?
> I suppose the number of executors will be determined by the number of
> regions of the table. Just like a MapReduce job, you needn't specify the
> number of map tasks when reading from a HBase table.
>
>   The script to submit my job can be seen in my second post. Please refer
> to that.
>
>
>
> 2014-10-08 13:44 GMT+08:00 Sean Owen <so...@cloudera.com>:
>
>> How did you run your program? I don't see from your earlier post that
>> you ever asked for more executors.
>>
>> On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao <xi...@gmail.com>
>> wrote:
>> > I found the reason why reading HBase is too slow.  Although each
>> > regionserver serves multiple regions for the table I'm reading, the
>> number
>> > of Spark workers allocated by Yarn is too low. Actually, I could see
>> that
>> > the table has dozens of regions spread over about 20 regionservers, but
>> only
>> > two Spark workers are allocated by Yarn. What is worse, the two workers
>> run
>> > one after one. So, the Spark job lost parallelism.
>> >
>> > So now the question is : Why are only 2 workers allocated?
>>
>
>

Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
Hi Sean,

   Do I need to specify the number of executors when submitting the job?  I
suppose the number of executors will be determined by the number of regions
of the table. Just like a MapReduce job, you needn't specify the number of
map tasks when reading from a HBase table.

  The script to submit my job can be seen in my second post. Please refer
to that.



2014-10-08 13:44 GMT+08:00 Sean Owen <so...@cloudera.com>:

> How did you run your program? I don't see from your earlier post that
> you ever asked for more executors.
>
> On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao <xi...@gmail.com> wrote:
> > I found the reason why reading HBase is too slow.  Although each
> > regionserver serves multiple regions for the table I'm reading, the
> number
> > of Spark workers allocated by Yarn is too low. Actually, I could see that
> > the table has dozens of regions spread over about 20 regionservers, but
> only
> > two Spark workers are allocated by Yarn. What is worse, the two workers
> run
> > one after one. So, the Spark job lost parallelism.
> >
> > So now the question is : Why are only 2 workers allocated?
>

Re: Reading from HBase is too slow

Posted by Sean Owen <so...@cloudera.com>.
How did you run your program? I don't see from your earlier post that
you ever asked for more executors.

On Wed, Oct 8, 2014 at 4:29 AM, Tao Xiao <xi...@gmail.com> wrote:
> I found the reason why reading HBase is too slow.  Although each
> regionserver serves multiple regions for the table I'm reading, the number
> of Spark workers allocated by Yarn is too low. Actually, I could see that
> the table has dozens of regions spread over about 20 regionservers, but only
> two Spark workers are allocated by Yarn. What is worse, the two workers run
> one after one. So, the Spark job lost parallelism.
>
> So now the question is : Why are only 2 workers allocated?

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
I found the reason why reading HBase is too slow.  Although each
regionserver serves multiple regions for the table I'm reading, the number
of Spark workers allocated by Yarn is too low. Actually, I could see that
the table has dozens of regions spread over about 20 regionservers, but
only two Spark workers are allocated by Yarn. What is worse, the two
workers run one after one. So, the Spark job lost parallelism.

*So now the question is : Why are only 2 workers allocated? *

The following is the log info in ApplicationMaster Log UI and we can see
that only 2 workers are allocated on two nodes (*a04.jsepc.com
<http://a04.jsepc.com>* and *b06 jsepc.com <http://jsepc.com>*)

Showing 4096 bytes. Click here for full log
erLauncher: ApplicationAttemptId: appattempt_1412731028648_0157_000001
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Registering the
ApplicationMaster
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Waiting for Spark driver to be
reachable.
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Driver now available:
a04.jsepc.com:56888
14/10/08 09:55:16 INFO yarn.WorkerLauncher: Listen to driver: akka.tcp://
spark@a04.jsepc.com:56888/user/CoarseGrainedScheduler
14/10/08 09:55:16 INFO yarn.WorkerLauncher: *Allocating 2 workers*.
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: *Will Allocate 2 worker
containers, each with 1408 memory*
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: Container request (host:
Any, priority: 1, capability: <memory:1408, vCores:1>
14/10/08 09:55:16 INFO yarn.YarnAllocationHandler: Container request (host:
Any, priority: 1, capability: <memory:1408, vCores:1>
14/10/08 09:55:20 INFO util.RackResolver: *Resolved a04.jsepc.com
<http://a04.jsepc.com> to /rack1*
14/10/08 09:55:20 INFO util.RackResolver: *Resolved b06.jsepc.com
<http://b06.jsepc.com> to /rack2*
14/10/08 09:55:20 INFO yarn.YarnAllocationHandler: Launching container
container_1412731028648_0157_01_000002 for on host a04.jsepc.com
14/10/08 09:55:20 INFO yarn.YarnAllocationHandler: Launching
WorkerRunnable. driverUrl: akka.tcp://
spark@a04.jsepc.com:56888/user/CoarseGrainedScheduler,  workerHostname:
a04.jsepc.com
14/10/08 09:55:21 INFO yarn.YarnAllocationHandler: Launching container
container_1412731028648_0157_01_000003 for on host b06.jsepc.com
14/10/08 09:55:21 INFO yarn.YarnAllocationHandler: Launching
WorkerRunnable. driverUrl: akka.tcp://
spark@a04.jsepc.com:56888/user/CoarseGrainedScheduler,  workerHostname:
b06.jsepc.com
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Starting Worker Container
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Starting Worker Container
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy:
yarn.client.max-nodemanagers-proxies : 500
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy:
yarn.client.max-nodemanagers-proxies : 500
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up
ContainerLaunchContext
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up
ContainerLaunchContext
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Preparing Local resources
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Preparing Local resources
14/10/08 09:55:21 INFO yarn.WorkerLauncher: All workers have launched.
14/10/08 09:55:21 INFO yarn.WorkerLauncher: Started progress reporter
thread - sleep time : 5000
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Prepared Local resources
Map(spark.jar -> resource { scheme: "hdfs" host: "jsepc-ns" port: -1 file:
"/user/root/.sparkStaging/application_1412731028648_0157/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar"
} size: 75288668 timestamp: 1412733307395 type: FILE visibility: PRIVATE)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Prepared Local resources
Map(spark.jar -> resource { scheme: "hdfs" host: "jsepc-ns" port: -1 file:
"/user/root/.sparkStaging/application_1412731028648_0157/spark-assembly_2.10-0.9.0-cdh5.0.1-hadoop2.3.0-cdh5.0.1.jar"
} size: 75288668 timestamp: 1412733307395 type: FILE visibility: PRIVATE)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up worker with
commands: List($JAVA_HOME/bin/java -server  -XX:OnOutOfMemoryError='kill
%p' -Xms1024m -Xmx1024m  -Djava.io.tmpdir=$PWD/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
spark@a04.jsepc.com:56888/user/CoarseGrainedScheduler 2 b06.jsepc.com 1 1>
<LOG_DIR>/stdout 2> <LOG_DIR>/stderr)
14/10/08 09:55:21 INFO yarn.WorkerRunnable: Setting up worker with
commands: List($JAVA_HOME/bin/java -server  -XX:OnOutOfMemoryError='kill
%p' -Xms1024m -Xmx1024m  -Djava.io.tmpdir=$PWD/tmp
 org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://
spark@a04.jsepc.com:56888/user/CoarseGrainedScheduler 1 a04.jsepc.com 1 1>
<LOG_DIR>/stdout 2> <LOG_DIR>/stderr)
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy: *Opening
proxy : a04.jsepc.com:8041 <http://a04.jsepc.com:8041>*
14/10/08 09:55:21 INFO impl.ContainerManagementProtocolProxy: *Opening
proxy : b06.jsepc.com:8041 <http://b06.jsepc.com:8041>*


 Here <http://pastebin.com/VhfmHPQe>is the log printed on console while the
Spark job is running.



2014-10-02 0:58 GMT+08:00 Vladimir Rodionov <vr...@splicemachine.com>:

> Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth
> upgrading to the latest version (which is 0.98 based).
>
> -Vladimir Rodionov
>
> On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu <yu...@gmail.com> wrote:
>
>> As far as I know, that feature is not in CDH 5.0.0
>>
>> FYI
>>
>> On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov <
>> vrodionov@splicemachine.com> wrote:
>>
>>> Using TableInputFormat is not the fastest way of reading data from
>>> HBase. Do not expect 100s of Mb per sec. You probably should take a look at
>>> M/R over HBase snapshots.
>>>
>>> https://issues.apache.org/jira/browse/HBASE-8369
>>>
>>> -Vladimir Rodionov
>>>
>>> On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao <xi...@gmail.com>
>>> wrote:
>>>
>>>> I can submit a MapReduce job reading that table, although its
>>>> processing rate is also a litter slower than I expected, but not that slow
>>>> as Spark.
>>>>
>>>>
>>>>
>>
>

Re: Reading from HBase is too slow

Posted by Vladimir Rodionov <vr...@splicemachine.com>.
Yes, its in 0.98. CDH is free (w/o subscription) and sometimes its worth
upgrading to the latest version (which is 0.98 based).

-Vladimir Rodionov

On Wed, Oct 1, 2014 at 9:52 AM, Ted Yu <yu...@gmail.com> wrote:

> As far as I know, that feature is not in CDH 5.0.0
>
> FYI
>
> On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov <
> vrodionov@splicemachine.com> wrote:
>
>> Using TableInputFormat is not the fastest way of reading data from HBase.
>> Do not expect 100s of Mb per sec. You probably should take a look at M/R
>> over HBase snapshots.
>>
>> https://issues.apache.org/jira/browse/HBASE-8369
>>
>> -Vladimir Rodionov
>>
>> On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao <xi...@gmail.com>
>> wrote:
>>
>>> I can submit a MapReduce job reading that table, although its processing
>>> rate is also a litter slower than I expected, but not that slow as Spark.
>>>
>>>
>>>
>

Re: Reading from HBase is too slow

Posted by Ted Yu <yu...@gmail.com>.
As far as I know, that feature is not in CDH 5.0.0

FYI

On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov <
vrodionov@splicemachine.com> wrote:

> Using TableInputFormat is not the fastest way of reading data from HBase.
> Do not expect 100s of Mb per sec. You probably should take a look at M/R
> over HBase snapshots.
>
> https://issues.apache.org/jira/browse/HBASE-8369
>
> -Vladimir Rodionov
>
> On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao <xi...@gmail.com> wrote:
>
>> I can submit a MapReduce job reading that table, although its processing
>> rate is also a litter slower than I expected, but not that slow as Spark.
>>
>>
>>

Re: Reading from HBase is too slow

Posted by Vladimir Rodionov <vr...@splicemachine.com>.
Using TableInputFormat is not the fastest way of reading data from HBase.
Do not expect 100s of Mb per sec. You probably should take a look at M/R
over HBase snapshots.

https://issues.apache.org/jira/browse/HBASE-8369

-Vladimir Rodionov

On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao <xi...@gmail.com> wrote:

> I can submit a MapReduce job reading that table, although its processing
> rate is also a litter slower than I expected, but not that slow as Spark.
>
>
>

Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
I can submit a MapReduce job reading that table, although its processing
rate is also a litter slower than I expected, but not that slow as Spark.

2014-10-01 12:04 GMT+08:00 Ted Yu <yu...@gmail.com>:

> Can you launch a job which exercises TableInputFormat on the same table
> without using Spark ?
>
> This would show whether the slowdown is in HBase code or somewhere else.
>
> Cheers
>
> On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao <xi...@gmail.com>
> wrote:
>
>> I checked HBase UI. Well, this table is not completely evenly spread
>> across the nodes, but I think to some extent it can be seen as nearly
>> evenly spread - at least there is not a single node which has too many
>> regions.  Here is a screenshot of HBase UI
>> <http://imgbin.org/index.php?page=image&id=19539>.
>>
>> Besides, I checked the size of each region in bytes for this table in the
>> HBase shell as follows:
>>
>>
>> -bash-4.1$ hadoop dfs -du -h /hbase/data/default/C_CONS
>> DEPRECATED: Use of this script to execute hdfs command is deprecated.
>> Instead use the hdfs command for it.
>>
>> 288      /hbase/data/default/C_CONS/.tabledesc
>> 0        /hbase/data/default/C_CONS/.tmp
>> 159.6 M  /hbase/data/default/C_CONS/0008c2494a5399d68495d9c8ae147821
>> 76.7 M   /hbase/data/default/C_CONS/021d7d21d7faeb7b2a77835d6f86747e
>> 81.3 M   /hbase/data/default/C_CONS/02a39a316ac6d2bda89e72e74aa18a6e
>> 155.3 M  /hbase/data/default/C_CONS/02fe51bc077290febc85651d8ee31abc
>> 173.4 M  /hbase/data/default/C_CONS/045859bcc70e36eb4d33f8ca3b7d9633
>> 82.6 M   /hbase/data/default/C_CONS/05c868b6036cc4f1836f70be6215c851
>> 74.1 M   /hbase/data/default/C_CONS/0816378c837f1f3b84f4d4060d22beb3
>> 84.7 M   /hbase/data/default/C_CONS/083da8f5eb8a5b1cca76376449f357ca
>> 346.6 M  /hbase/data/default/C_CONS/0ac70fcb1baea0896ea069a6bcc30898
>> 333.8 M  /hbase/data/default/C_CONS/0b3be845bd4f5e958e8c9a18c8eaab21
>> 72.7 M   /hbase/data/default/C_CONS/12c13610c50dbc8ab27f20b0ebf2bfc4
>> 76.1 M   /hbase/data/default/C_CONS/1341966315d7e53be719d948d595bee0
>> 72.4 M   /hbase/data/default/C_CONS/1acdbc05c502b11da4852a1f21228f44
>> 70.0 M   /hbase/data/default/C_CONS/1b8f57d65f6c0e4de721e4c8f1944829
>> 183.9 M  /hbase/data/default/C_CONS/1f1ae7ca9f725fcf9639a4d52086fa50
>> 65.5 M   /hbase/data/default/C_CONS/20c10b96e2b9c40684aaeb6d0cfbf7c0
>> 76.0 M   /hbase/data/default/C_CONS/22515194fe09adcd4cbb2f5307303c73
>> 78.4 M   /hbase/data/default/C_CONS/236cd80393cb5b7c526bd2c45ce53a0a
>> 150.0 M  /hbase/data/default/C_CONS/23bd80852f47b97b4122709ec844d4ed
>> 81.6 M   /hbase/data/default/C_CONS/241b8bc415029dedf94c4a84e6c4ad3b
>> 77.9 M   /hbase/data/default/C_CONS/27f1e59bde75ef3096a5bdd3eb402cd7
>> 160.8 M  /hbase/data/default/C_CONS/30c2ae3be38b8cdf3b337054a7d61478
>> 372.2 M  /hbase/data/default/C_CONS/31d606da71b35844d0cdc8a195c97d2e
>> 182.6 M  /hbase/data/default/C_CONS/3274a022bc7419d426cf63caa1cc88e1
>> 92.1 M   /hbase/data/default/C_CONS/344faae7971d87b51edf23f75a7c3746
>> 154.7 M  /hbase/data/default/C_CONS/3b3f0c839bdb32ed2104f67c8a02da41
>> 77.4 M   /hbase/data/default/C_CONS/3cf6b2bd0cfe85f3111d0ba1b84a60b4
>> 71.5 M   /hbase/data/default/C_CONS/3f466db078d07e2ddddbfb11c681e0e3
>> 77.8 M   /hbase/data/default/C_CONS/3f8c1b7dec05118eb9894bb591e32b2f
>> 83.6 M   /hbase/data/default/C_CONS/45e105856fcb54748c48bd45e973a3b9
>> 185.2 M  /hbase/data/default/C_CONS/4becd90d46a2d4a6bd8ecbe02b60892c
>> 165.6 M  /hbase/data/default/C_CONS/4dcebd58c7013062c4a8583012a11b5a
>> 67.3 M   /hbase/data/default/C_CONS/51f845d842605dda66b1ae01ad8a17e8
>> 148.2 M  /hbase/data/default/C_CONS/532189155ab78dbd1e36aac3ab4878a8
>> 172.6 M  /hbase/data/default/C_CONS/5401d9cb19adb9bd78718ea047e6d9d7
>> 139.4 M  /hbase/data/default/C_CONS/547d2a8c54aae73e8f12b4570efd984c
>> 89.5 M   /hbase/data/default/C_CONS/54cbac1f71c7781697052bb2aa1c5a18
>> 101.3 M  /hbase/data/default/C_CONS/55263ce293327683b9c6e6098ec3e89a
>> 85.2 M   /hbase/data/default/C_CONS/55f8c278e35de6bca5083c7a66e355fb
>> 85.8 M   /hbase/data/default/C_CONS/57112558912e1de016327e115bc84f11
>> 171.8 M  /hbase/data/default/C_CONS/572b886cbfe92ddcb97502f041953fb8
>> 51       /hbase/data/default/C_CONS/6bd64d8cf6b38806731f7693bdd673c9
>> 86.6 M   /hbase/data/default/C_CONS/7695703b7b527afc5f3524eee9b5d806
>> 74.8 M   /hbase/data/default/C_CONS/7bb7567685f5e16a4379d7cf79de2ecc
>> 120.1 M  /hbase/data/default/C_CONS/7c144bef991bb3c959d7ef6e2fa5036a
>> 166.0 M  /hbase/data/default/C_CONS/7c7817eb3e531d5bda88b5f0de6a20de
>> 173.5 M  /hbase/data/default/C_CONS/7d07c139575d007ecbb23fa946e39130
>> 139.2 M  /hbase/data/default/C_CONS/8295aa701110ddf4055e8c3ca5bd9cad
>> 91.7 M   /hbase/data/default/C_CONS/84b340d22471580ed8100d6614668eb1
>> 81.2 M   /hbase/data/default/C_CONS/8605f4470498a01a5ec4c88e7ea8a458
>> 78.3 M   /hbase/data/default/C_CONS/897da8e33275b80926ef38200132f819
>> 234.4 M  /hbase/data/default/C_CONS/93f5ce30ed8e54cc282cb5b88fa28d76
>> 126.3 M  /hbase/data/default/C_CONS/96dd1decd62e35c394bb8e7f6095f054
>> 80.9 M   /hbase/data/default/C_CONS/998364405e57a7eedae094bca76a419e
>> 184.8 M  /hbase/data/default/C_CONS/9df3b62b1bff59b67b75ad86d694b8c8
>> 126.6 M  /hbase/data/default/C_CONS/a4531e06f3440349e7e6776b8bfedaf0
>> 79.3 M   /hbase/data/default/C_CONS/aa0b8341d3ca925ed24309f46e0ab845
>> 79.9 M   /hbase/data/default/C_CONS/aa45bfa549a439ded2a8b159a5c9caaa
>> 84.9 M   /hbase/data/default/C_CONS/abae60b33de2999698a7452ff62dad08
>> 87.0 M   /hbase/data/default/C_CONS/ac5ff05785bc6e07637106450c74d02a
>> 80.7 M   /hbase/data/default/C_CONS/aca765b578b236978b11ec26c167a958
>> 68.0 M   /hbase/data/default/C_CONS/b03614566cc8d521a9c983d418b57866
>> 77.4 M   /hbase/data/default/C_CONS/b1ae0451f592b28eed8a58908f91293a
>> 91.5 M   /hbase/data/default/C_CONS/b8396049e2b742108add1485c0eb4aeb
>> 81.2 M   /hbase/data/default/C_CONS/b8d25b3e536b4fea5ee4ee2b21885c76
>> 87.8 M   /hbase/data/default/C_CONS/bbfbe319705df23a23a89b40e52d89a8
>> 81.3 M   /hbase/data/default/C_CONS/bccaeedc65d9295289f78aaec588cc3d
>> 95.8 M   /hbase/data/default/C_CONS/c229d583958802571dfaa9a39453df0d
>> 88.5 M   /hbase/data/default/C_CONS/c9d7a038243d1b3e2448a48007f1f9e0
>> 158.8 M  /hbase/data/default/C_CONS/cca1bf1f013724af25d71ad4310e5d4a
>> 212.8 M  /hbase/data/default/C_CONS/ccabf798734aa8e05798c43c132ad565
>> 85.1 M   /hbase/data/default/C_CONS/d1cb54346e109b1ba76fd95aa4540161
>> 84.4 M   /hbase/data/default/C_CONS/d4dd8c3fa81b751892689cc92a96aa99
>> 139.5 M  /hbase/data/default/C_CONS/dc15ceeed21474b51086f3103cbd0074
>> 97.7 M   /hbase/data/default/C_CONS/df20e2077f22e83ecd8e55550d52dea1
>> 221.0 M  /hbase/data/default/C_CONS/e30d0d55e0887a676c8b79e03771ad23
>> 75.7 M   /hbase/data/default/C_CONS/e6ed24ce0b3e1e903bd9757d28380f3a
>> 74.9 M   /hbase/data/default/C_CONS/e9732d9905f5373fb0fd7a1ce033e17b
>> 101.2 M  /hbase/data/default/C_CONS/f2a49dbaf018f0e45bbd7a758f123418
>> 172.6 M  /hbase/data/default/C_CONS/f34645de36d3c1413ce83177e2118947
>> 89.2 M   /hbase/data/default/C_CONS/f3db2bf3b7ffb7b4c0029eac5d631bdb
>> 81.6 M   /hbase/data/default/C_CONS/f43b49c4f384853266e9ee45a98104a6
>> 68.9 M   /hbase/data/default/C_CONS/fa4fb0047ec98fb10bf84fd72937f415
>> 86.7 M   /hbase/data/default/C_CONS/fc69f349655676e046c9110550825f5a
>> 155.0 M  /hbase/data/default/C_CONS/feb0835bdf73c257de11c65f18b1330d
>> 75.2 M   /hbase/data/default/C_CONS/fff9fbe56af8b9e0e00826f8936e7a56
>>
>>
>>
>> From the result above we can see that the biggest region's size is 346.6
>> M, while most other regions' size are near each other.
>>
>> So what may be the real reason ?
>>
>> 2014-09-30 12:17 GMT+08:00 Vladimir Rodionov <vrodionov@splicemachine.com
>> >:
>>
>>> HBase TableInputFormat creates input splits one per each region. You can
>>> not achieve high level of parallelism unless you have 5-10 regions per RS
>>> at least. What does it mean? You probably have too few regions. You can
>>> verify that in HBase Web UI.
>>>
>>> -Vladimir Rodionov
>>>
>>> On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xi...@gmail.com>
>>> wrote:
>>>
>>>> I submitted a job in Yarn-Client mode, which simply reads from a HBase
>>>> table containing tens of millions of records and then does a *count *action.
>>>> The job runs for a much longer time than I expected, so I wonder whether it
>>>> was because the data to read was too much. Actually, there are 20 nodes in
>>>> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
>>>> records). :
>>>>
>>>> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>>>>
>>>> BTW, when the job was running, I can see logs on the console, and
>>>> specifically I'd like to know what the following log means:
>>>>
>>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20
>>>> as TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
>>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20
>>>> as 13454 bytes in 0 ms
>>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in
>>>> 16426 ms on b04.jsepc.com (progress: 18/86)
>>>> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0,
>>>> 19)
>>>>
>>>>
>>>> Thanks
>>>>
>>>
>>>
>>
>

Re: Reading from HBase is too slow

Posted by Ted Yu <yu...@gmail.com>.
Can you launch a job which exercises TableInputFormat on the same table
without using Spark ?

This would show whether the slowdown is in HBase code or somewhere else.

Cheers

On Mon, Sep 29, 2014 at 11:40 PM, Tao Xiao <xi...@gmail.com> wrote:

> I checked HBase UI. Well, this table is not completely evenly spread
> across the nodes, but I think to some extent it can be seen as nearly
> evenly spread - at least there is not a single node which has too many
> regions.  Here is a screenshot of HBase UI
> <http://imgbin.org/index.php?page=image&id=19539>.
>
> Besides, I checked the size of each region in bytes for this table in the
> HBase shell as follows:
>
>
> -bash-4.1$ hadoop dfs -du -h /hbase/data/default/C_CONS
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
>
> 288      /hbase/data/default/C_CONS/.tabledesc
> 0        /hbase/data/default/C_CONS/.tmp
> 159.6 M  /hbase/data/default/C_CONS/0008c2494a5399d68495d9c8ae147821
> 76.7 M   /hbase/data/default/C_CONS/021d7d21d7faeb7b2a77835d6f86747e
> 81.3 M   /hbase/data/default/C_CONS/02a39a316ac6d2bda89e72e74aa18a6e
> 155.3 M  /hbase/data/default/C_CONS/02fe51bc077290febc85651d8ee31abc
> 173.4 M  /hbase/data/default/C_CONS/045859bcc70e36eb4d33f8ca3b7d9633
> 82.6 M   /hbase/data/default/C_CONS/05c868b6036cc4f1836f70be6215c851
> 74.1 M   /hbase/data/default/C_CONS/0816378c837f1f3b84f4d4060d22beb3
> 84.7 M   /hbase/data/default/C_CONS/083da8f5eb8a5b1cca76376449f357ca
> 346.6 M  /hbase/data/default/C_CONS/0ac70fcb1baea0896ea069a6bcc30898
> 333.8 M  /hbase/data/default/C_CONS/0b3be845bd4f5e958e8c9a18c8eaab21
> 72.7 M   /hbase/data/default/C_CONS/12c13610c50dbc8ab27f20b0ebf2bfc4
> 76.1 M   /hbase/data/default/C_CONS/1341966315d7e53be719d948d595bee0
> 72.4 M   /hbase/data/default/C_CONS/1acdbc05c502b11da4852a1f21228f44
> 70.0 M   /hbase/data/default/C_CONS/1b8f57d65f6c0e4de721e4c8f1944829
> 183.9 M  /hbase/data/default/C_CONS/1f1ae7ca9f725fcf9639a4d52086fa50
> 65.5 M   /hbase/data/default/C_CONS/20c10b96e2b9c40684aaeb6d0cfbf7c0
> 76.0 M   /hbase/data/default/C_CONS/22515194fe09adcd4cbb2f5307303c73
> 78.4 M   /hbase/data/default/C_CONS/236cd80393cb5b7c526bd2c45ce53a0a
> 150.0 M  /hbase/data/default/C_CONS/23bd80852f47b97b4122709ec844d4ed
> 81.6 M   /hbase/data/default/C_CONS/241b8bc415029dedf94c4a84e6c4ad3b
> 77.9 M   /hbase/data/default/C_CONS/27f1e59bde75ef3096a5bdd3eb402cd7
> 160.8 M  /hbase/data/default/C_CONS/30c2ae3be38b8cdf3b337054a7d61478
> 372.2 M  /hbase/data/default/C_CONS/31d606da71b35844d0cdc8a195c97d2e
> 182.6 M  /hbase/data/default/C_CONS/3274a022bc7419d426cf63caa1cc88e1
> 92.1 M   /hbase/data/default/C_CONS/344faae7971d87b51edf23f75a7c3746
> 154.7 M  /hbase/data/default/C_CONS/3b3f0c839bdb32ed2104f67c8a02da41
> 77.4 M   /hbase/data/default/C_CONS/3cf6b2bd0cfe85f3111d0ba1b84a60b4
> 71.5 M   /hbase/data/default/C_CONS/3f466db078d07e2ddddbfb11c681e0e3
> 77.8 M   /hbase/data/default/C_CONS/3f8c1b7dec05118eb9894bb591e32b2f
> 83.6 M   /hbase/data/default/C_CONS/45e105856fcb54748c48bd45e973a3b9
> 185.2 M  /hbase/data/default/C_CONS/4becd90d46a2d4a6bd8ecbe02b60892c
> 165.6 M  /hbase/data/default/C_CONS/4dcebd58c7013062c4a8583012a11b5a
> 67.3 M   /hbase/data/default/C_CONS/51f845d842605dda66b1ae01ad8a17e8
> 148.2 M  /hbase/data/default/C_CONS/532189155ab78dbd1e36aac3ab4878a8
> 172.6 M  /hbase/data/default/C_CONS/5401d9cb19adb9bd78718ea047e6d9d7
> 139.4 M  /hbase/data/default/C_CONS/547d2a8c54aae73e8f12b4570efd984c
> 89.5 M   /hbase/data/default/C_CONS/54cbac1f71c7781697052bb2aa1c5a18
> 101.3 M  /hbase/data/default/C_CONS/55263ce293327683b9c6e6098ec3e89a
> 85.2 M   /hbase/data/default/C_CONS/55f8c278e35de6bca5083c7a66e355fb
> 85.8 M   /hbase/data/default/C_CONS/57112558912e1de016327e115bc84f11
> 171.8 M  /hbase/data/default/C_CONS/572b886cbfe92ddcb97502f041953fb8
> 51       /hbase/data/default/C_CONS/6bd64d8cf6b38806731f7693bdd673c9
> 86.6 M   /hbase/data/default/C_CONS/7695703b7b527afc5f3524eee9b5d806
> 74.8 M   /hbase/data/default/C_CONS/7bb7567685f5e16a4379d7cf79de2ecc
> 120.1 M  /hbase/data/default/C_CONS/7c144bef991bb3c959d7ef6e2fa5036a
> 166.0 M  /hbase/data/default/C_CONS/7c7817eb3e531d5bda88b5f0de6a20de
> 173.5 M  /hbase/data/default/C_CONS/7d07c139575d007ecbb23fa946e39130
> 139.2 M  /hbase/data/default/C_CONS/8295aa701110ddf4055e8c3ca5bd9cad
> 91.7 M   /hbase/data/default/C_CONS/84b340d22471580ed8100d6614668eb1
> 81.2 M   /hbase/data/default/C_CONS/8605f4470498a01a5ec4c88e7ea8a458
> 78.3 M   /hbase/data/default/C_CONS/897da8e33275b80926ef38200132f819
> 234.4 M  /hbase/data/default/C_CONS/93f5ce30ed8e54cc282cb5b88fa28d76
> 126.3 M  /hbase/data/default/C_CONS/96dd1decd62e35c394bb8e7f6095f054
> 80.9 M   /hbase/data/default/C_CONS/998364405e57a7eedae094bca76a419e
> 184.8 M  /hbase/data/default/C_CONS/9df3b62b1bff59b67b75ad86d694b8c8
> 126.6 M  /hbase/data/default/C_CONS/a4531e06f3440349e7e6776b8bfedaf0
> 79.3 M   /hbase/data/default/C_CONS/aa0b8341d3ca925ed24309f46e0ab845
> 79.9 M   /hbase/data/default/C_CONS/aa45bfa549a439ded2a8b159a5c9caaa
> 84.9 M   /hbase/data/default/C_CONS/abae60b33de2999698a7452ff62dad08
> 87.0 M   /hbase/data/default/C_CONS/ac5ff05785bc6e07637106450c74d02a
> 80.7 M   /hbase/data/default/C_CONS/aca765b578b236978b11ec26c167a958
> 68.0 M   /hbase/data/default/C_CONS/b03614566cc8d521a9c983d418b57866
> 77.4 M   /hbase/data/default/C_CONS/b1ae0451f592b28eed8a58908f91293a
> 91.5 M   /hbase/data/default/C_CONS/b8396049e2b742108add1485c0eb4aeb
> 81.2 M   /hbase/data/default/C_CONS/b8d25b3e536b4fea5ee4ee2b21885c76
> 87.8 M   /hbase/data/default/C_CONS/bbfbe319705df23a23a89b40e52d89a8
> 81.3 M   /hbase/data/default/C_CONS/bccaeedc65d9295289f78aaec588cc3d
> 95.8 M   /hbase/data/default/C_CONS/c229d583958802571dfaa9a39453df0d
> 88.5 M   /hbase/data/default/C_CONS/c9d7a038243d1b3e2448a48007f1f9e0
> 158.8 M  /hbase/data/default/C_CONS/cca1bf1f013724af25d71ad4310e5d4a
> 212.8 M  /hbase/data/default/C_CONS/ccabf798734aa8e05798c43c132ad565
> 85.1 M   /hbase/data/default/C_CONS/d1cb54346e109b1ba76fd95aa4540161
> 84.4 M   /hbase/data/default/C_CONS/d4dd8c3fa81b751892689cc92a96aa99
> 139.5 M  /hbase/data/default/C_CONS/dc15ceeed21474b51086f3103cbd0074
> 97.7 M   /hbase/data/default/C_CONS/df20e2077f22e83ecd8e55550d52dea1
> 221.0 M  /hbase/data/default/C_CONS/e30d0d55e0887a676c8b79e03771ad23
> 75.7 M   /hbase/data/default/C_CONS/e6ed24ce0b3e1e903bd9757d28380f3a
> 74.9 M   /hbase/data/default/C_CONS/e9732d9905f5373fb0fd7a1ce033e17b
> 101.2 M  /hbase/data/default/C_CONS/f2a49dbaf018f0e45bbd7a758f123418
> 172.6 M  /hbase/data/default/C_CONS/f34645de36d3c1413ce83177e2118947
> 89.2 M   /hbase/data/default/C_CONS/f3db2bf3b7ffb7b4c0029eac5d631bdb
> 81.6 M   /hbase/data/default/C_CONS/f43b49c4f384853266e9ee45a98104a6
> 68.9 M   /hbase/data/default/C_CONS/fa4fb0047ec98fb10bf84fd72937f415
> 86.7 M   /hbase/data/default/C_CONS/fc69f349655676e046c9110550825f5a
> 155.0 M  /hbase/data/default/C_CONS/feb0835bdf73c257de11c65f18b1330d
> 75.2 M   /hbase/data/default/C_CONS/fff9fbe56af8b9e0e00826f8936e7a56
>
>
>
> From the result above we can see that the biggest region's size is 346.6
> M, while most other regions' size are near each other.
>
> So what may be the real reason ?
>
> 2014-09-30 12:17 GMT+08:00 Vladimir Rodionov <vr...@splicemachine.com>
> :
>
>> HBase TableInputFormat creates input splits one per each region. You can
>> not achieve high level of parallelism unless you have 5-10 regions per RS
>> at least. What does it mean? You probably have too few regions. You can
>> verify that in HBase Web UI.
>>
>> -Vladimir Rodionov
>>
>> On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xi...@gmail.com>
>> wrote:
>>
>>> I submitted a job in Yarn-Client mode, which simply reads from a HBase
>>> table containing tens of millions of records and then does a *count *action.
>>> The job runs for a much longer time than I expected, so I wonder whether it
>>> was because the data to read was too much. Actually, there are 20 nodes in
>>> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
>>> records). :
>>>
>>> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>>>
>>> BTW, when the job was running, I can see logs on the console, and
>>> specifically I'd like to know what the following log means:
>>>
>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
>>> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20
>>> as 13454 bytes in 0 ms
>>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in
>>> 16426 ms on b04.jsepc.com (progress: 18/86)
>>> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0,
>>> 19)
>>>
>>>
>>> Thanks
>>>
>>
>>
>

Re: Reading from HBase is too slow

Posted by Tao Xiao <xi...@gmail.com>.
I checked HBase UI. Well, this table is not completely evenly spread across
the nodes, but I think to some extent it can be seen as nearly evenly
spread - at least there is not a single node which has too many regions.
Here is a screenshot of HBase UI
<http://imgbin.org/index.php?page=image&id=19539>.

Besides, I checked the size of each region in bytes for this table in the
HBase shell as follows:


-bash-4.1$ hadoop dfs -du -h /hbase/data/default/C_CONS
DEPRECATED: Use of this script to execute hdfs command is deprecated.
Instead use the hdfs command for it.

288      /hbase/data/default/C_CONS/.tabledesc
0        /hbase/data/default/C_CONS/.tmp
159.6 M  /hbase/data/default/C_CONS/0008c2494a5399d68495d9c8ae147821
76.7 M   /hbase/data/default/C_CONS/021d7d21d7faeb7b2a77835d6f86747e
81.3 M   /hbase/data/default/C_CONS/02a39a316ac6d2bda89e72e74aa18a6e
155.3 M  /hbase/data/default/C_CONS/02fe51bc077290febc85651d8ee31abc
173.4 M  /hbase/data/default/C_CONS/045859bcc70e36eb4d33f8ca3b7d9633
82.6 M   /hbase/data/default/C_CONS/05c868b6036cc4f1836f70be6215c851
74.1 M   /hbase/data/default/C_CONS/0816378c837f1f3b84f4d4060d22beb3
84.7 M   /hbase/data/default/C_CONS/083da8f5eb8a5b1cca76376449f357ca
346.6 M  /hbase/data/default/C_CONS/0ac70fcb1baea0896ea069a6bcc30898
333.8 M  /hbase/data/default/C_CONS/0b3be845bd4f5e958e8c9a18c8eaab21
72.7 M   /hbase/data/default/C_CONS/12c13610c50dbc8ab27f20b0ebf2bfc4
76.1 M   /hbase/data/default/C_CONS/1341966315d7e53be719d948d595bee0
72.4 M   /hbase/data/default/C_CONS/1acdbc05c502b11da4852a1f21228f44
70.0 M   /hbase/data/default/C_CONS/1b8f57d65f6c0e4de721e4c8f1944829
183.9 M  /hbase/data/default/C_CONS/1f1ae7ca9f725fcf9639a4d52086fa50
65.5 M   /hbase/data/default/C_CONS/20c10b96e2b9c40684aaeb6d0cfbf7c0
76.0 M   /hbase/data/default/C_CONS/22515194fe09adcd4cbb2f5307303c73
78.4 M   /hbase/data/default/C_CONS/236cd80393cb5b7c526bd2c45ce53a0a
150.0 M  /hbase/data/default/C_CONS/23bd80852f47b97b4122709ec844d4ed
81.6 M   /hbase/data/default/C_CONS/241b8bc415029dedf94c4a84e6c4ad3b
77.9 M   /hbase/data/default/C_CONS/27f1e59bde75ef3096a5bdd3eb402cd7
160.8 M  /hbase/data/default/C_CONS/30c2ae3be38b8cdf3b337054a7d61478
372.2 M  /hbase/data/default/C_CONS/31d606da71b35844d0cdc8a195c97d2e
182.6 M  /hbase/data/default/C_CONS/3274a022bc7419d426cf63caa1cc88e1
92.1 M   /hbase/data/default/C_CONS/344faae7971d87b51edf23f75a7c3746
154.7 M  /hbase/data/default/C_CONS/3b3f0c839bdb32ed2104f67c8a02da41
77.4 M   /hbase/data/default/C_CONS/3cf6b2bd0cfe85f3111d0ba1b84a60b4
71.5 M   /hbase/data/default/C_CONS/3f466db078d07e2ddddbfb11c681e0e3
77.8 M   /hbase/data/default/C_CONS/3f8c1b7dec05118eb9894bb591e32b2f
83.6 M   /hbase/data/default/C_CONS/45e105856fcb54748c48bd45e973a3b9
185.2 M  /hbase/data/default/C_CONS/4becd90d46a2d4a6bd8ecbe02b60892c
165.6 M  /hbase/data/default/C_CONS/4dcebd58c7013062c4a8583012a11b5a
67.3 M   /hbase/data/default/C_CONS/51f845d842605dda66b1ae01ad8a17e8
148.2 M  /hbase/data/default/C_CONS/532189155ab78dbd1e36aac3ab4878a8
172.6 M  /hbase/data/default/C_CONS/5401d9cb19adb9bd78718ea047e6d9d7
139.4 M  /hbase/data/default/C_CONS/547d2a8c54aae73e8f12b4570efd984c
89.5 M   /hbase/data/default/C_CONS/54cbac1f71c7781697052bb2aa1c5a18
101.3 M  /hbase/data/default/C_CONS/55263ce293327683b9c6e6098ec3e89a
85.2 M   /hbase/data/default/C_CONS/55f8c278e35de6bca5083c7a66e355fb
85.8 M   /hbase/data/default/C_CONS/57112558912e1de016327e115bc84f11
171.8 M  /hbase/data/default/C_CONS/572b886cbfe92ddcb97502f041953fb8
51       /hbase/data/default/C_CONS/6bd64d8cf6b38806731f7693bdd673c9
86.6 M   /hbase/data/default/C_CONS/7695703b7b527afc5f3524eee9b5d806
74.8 M   /hbase/data/default/C_CONS/7bb7567685f5e16a4379d7cf79de2ecc
120.1 M  /hbase/data/default/C_CONS/7c144bef991bb3c959d7ef6e2fa5036a
166.0 M  /hbase/data/default/C_CONS/7c7817eb3e531d5bda88b5f0de6a20de
173.5 M  /hbase/data/default/C_CONS/7d07c139575d007ecbb23fa946e39130
139.2 M  /hbase/data/default/C_CONS/8295aa701110ddf4055e8c3ca5bd9cad
91.7 M   /hbase/data/default/C_CONS/84b340d22471580ed8100d6614668eb1
81.2 M   /hbase/data/default/C_CONS/8605f4470498a01a5ec4c88e7ea8a458
78.3 M   /hbase/data/default/C_CONS/897da8e33275b80926ef38200132f819
234.4 M  /hbase/data/default/C_CONS/93f5ce30ed8e54cc282cb5b88fa28d76
126.3 M  /hbase/data/default/C_CONS/96dd1decd62e35c394bb8e7f6095f054
80.9 M   /hbase/data/default/C_CONS/998364405e57a7eedae094bca76a419e
184.8 M  /hbase/data/default/C_CONS/9df3b62b1bff59b67b75ad86d694b8c8
126.6 M  /hbase/data/default/C_CONS/a4531e06f3440349e7e6776b8bfedaf0
79.3 M   /hbase/data/default/C_CONS/aa0b8341d3ca925ed24309f46e0ab845
79.9 M   /hbase/data/default/C_CONS/aa45bfa549a439ded2a8b159a5c9caaa
84.9 M   /hbase/data/default/C_CONS/abae60b33de2999698a7452ff62dad08
87.0 M   /hbase/data/default/C_CONS/ac5ff05785bc6e07637106450c74d02a
80.7 M   /hbase/data/default/C_CONS/aca765b578b236978b11ec26c167a958
68.0 M   /hbase/data/default/C_CONS/b03614566cc8d521a9c983d418b57866
77.4 M   /hbase/data/default/C_CONS/b1ae0451f592b28eed8a58908f91293a
91.5 M   /hbase/data/default/C_CONS/b8396049e2b742108add1485c0eb4aeb
81.2 M   /hbase/data/default/C_CONS/b8d25b3e536b4fea5ee4ee2b21885c76
87.8 M   /hbase/data/default/C_CONS/bbfbe319705df23a23a89b40e52d89a8
81.3 M   /hbase/data/default/C_CONS/bccaeedc65d9295289f78aaec588cc3d
95.8 M   /hbase/data/default/C_CONS/c229d583958802571dfaa9a39453df0d
88.5 M   /hbase/data/default/C_CONS/c9d7a038243d1b3e2448a48007f1f9e0
158.8 M  /hbase/data/default/C_CONS/cca1bf1f013724af25d71ad4310e5d4a
212.8 M  /hbase/data/default/C_CONS/ccabf798734aa8e05798c43c132ad565
85.1 M   /hbase/data/default/C_CONS/d1cb54346e109b1ba76fd95aa4540161
84.4 M   /hbase/data/default/C_CONS/d4dd8c3fa81b751892689cc92a96aa99
139.5 M  /hbase/data/default/C_CONS/dc15ceeed21474b51086f3103cbd0074
97.7 M   /hbase/data/default/C_CONS/df20e2077f22e83ecd8e55550d52dea1
221.0 M  /hbase/data/default/C_CONS/e30d0d55e0887a676c8b79e03771ad23
75.7 M   /hbase/data/default/C_CONS/e6ed24ce0b3e1e903bd9757d28380f3a
74.9 M   /hbase/data/default/C_CONS/e9732d9905f5373fb0fd7a1ce033e17b
101.2 M  /hbase/data/default/C_CONS/f2a49dbaf018f0e45bbd7a758f123418
172.6 M  /hbase/data/default/C_CONS/f34645de36d3c1413ce83177e2118947
89.2 M   /hbase/data/default/C_CONS/f3db2bf3b7ffb7b4c0029eac5d631bdb
81.6 M   /hbase/data/default/C_CONS/f43b49c4f384853266e9ee45a98104a6
68.9 M   /hbase/data/default/C_CONS/fa4fb0047ec98fb10bf84fd72937f415
86.7 M   /hbase/data/default/C_CONS/fc69f349655676e046c9110550825f5a
155.0 M  /hbase/data/default/C_CONS/feb0835bdf73c257de11c65f18b1330d
75.2 M   /hbase/data/default/C_CONS/fff9fbe56af8b9e0e00826f8936e7a56



>From the result above we can see that the biggest region's size is 346.6 M,
while most other regions' size are near each other.

So what may be the real reason ?

2014-09-30 12:17 GMT+08:00 Vladimir Rodionov <vr...@splicemachine.com>:

> HBase TableInputFormat creates input splits one per each region. You can
> not achieve high level of parallelism unless you have 5-10 regions per RS
> at least. What does it mean? You probably have too few regions. You can
> verify that in HBase Web UI.
>
> -Vladimir Rodionov
>
> On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xi...@gmail.com>
> wrote:
>
>> I submitted a job in Yarn-Client mode, which simply reads from a HBase
>> table containing tens of millions of records and then does a *count *action.
>> The job runs for a much longer time than I expected, so I wonder whether it
>> was because the data to read was too much. Actually, there are 20 nodes in
>> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
>> records). :
>>
>> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>>
>> BTW, when the job was running, I can see logs on the console, and
>> specifically I'd like to know what the following log means:
>>
>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
>> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20
>> as 13454 bytes in 0 ms
>> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
>> ms on b04.jsepc.com (progress: 18/86)
>> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>>
>>
>> Thanks
>>
>
>

Re: Reading from HBase is too slow

Posted by Vladimir Rodionov <vr...@splicemachine.com>.
HBase TableInputFormat creates input splits one per each region. You can
not achieve high level of parallelism unless you have 5-10 regions per RS
at least. What does it mean? You probably have too few regions. You can
verify that in HBase Web UI.

-Vladimir Rodionov

On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xi...@gmail.com> wrote:

> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>

Re: Reading from HBase is too slow

Posted by Ted Yu <yu...@gmail.com>.
Are the regions for this table evenly spread across nodes in your cluster ?

Were region servers under (heavy) load when your job ran ?

Cheers

On Mon, Sep 29, 2014 at 7:21 PM, Tao Xiao <xi...@gmail.com> wrote:

> I submitted a job in Yarn-Client mode, which simply reads from a HBase
> table containing tens of millions of records and then does a *count *action.
> The job runs for a much longer time than I expected, so I wonder whether it
> was because the data to read was too much. Actually, there are 20 nodes in
> my Hadoop cluster so the HBase table seems not so big (tens of millopns of
> records). :
>
> I'm using CDH 5.0.0 (Spark 0.9 and HBase 0.96).
>
> BTW, when the job was running, I can see logs on the console, and
> specifically I'd like to know what the following log means:
>
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Starting task 0.0:20 as
> TID 20 on executor 2: b04.jsepc.com (PROCESS_LOCAL)
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Serialized task 0.0:20 as
> 13454 bytes in 0 ms
> 14/09/30 09:45:20 INFO scheduler.TaskSetManager: Finished TID 19 in 16426
> ms on b04.jsepc.com (progress: 18/86)
> 14/09/30 09:45:20 INFO scheduler.DAGScheduler: Completed ResultTask(0, 19)
>
>
> Thanks
>