You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bonnahu <bo...@gmail.com> on 2014/12/04 23:56:26 UTC

Loading a large Hbase table into SPARK RDD takes quite long time

I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
query on the entity. For an entity with about 6 million rows, it will take
about 35 seconds to load it to RDD. Is it expected? Is there any way to
shorten the loading process? I have been getting some tips from
http://hbase.apache.org/book/perf.reading.html to speed up the process,
e.g., scan.setCaching(cacheSize) and only add the necessary
attributes/column to scan. I am just wondering if there are other ways to
improve the speed?

Here is the code snippet:

SparkConf sparkConf = new
SparkConf().setMaster("spark://url").setAppName("SparkSQLTest");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
Configuration hbase_conf = HBaseConfiguration.create();
hbase_conf.set("hbase.zookeeper.quorum","url");
hbase_conf.set("hbase.regionserver.port", "60020");
hbase_conf.set("hbase.master", "url");
hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
Scan scan = new Scan();
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1"));
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2"));
scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3"));
scan.setCaching(this.cacheSize);
hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD 
= jsc.newAPIHadoopRDD(hbase_conf,
            TableInputFormat.class, ImmutableBytesWritable.class,
            Result.class);
logger.info("count is " + hBaseRDD.cache().count());    



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Loading a large Hbase table into SPARK RDD takes quite long time

Posted by bonnahu <bo...@gmail.com>.

Hi Ted,
Here is the information about the Regions:
Region Server	                Region Count
http://regionserver1:60030/	44
http://regionserver2:60030/	39
http://regionserver3:60030/	55




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20417.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Loading a large Hbase table into SPARK RDD takes quite long time

Posted by Ted Yu <yu...@gmail.com>.

bonnahu:
How many regions does your table have ?
Are they evenly distributed ?

Cheers

On Thu, Dec 4, 2014 at 3:34 PM, Jörn Franke <jo...@gmail.com> wrote:

> Hi,
>
> What is your cluster setup? How mich memory do you have? How much space
> does one row only consisting of the 3 columns consume? Do you run other
> stuff in the background?
>
> Best regards
> Am 04.12.2014 23:57 schrieb "bonnahu" <bo...@gmail.com>:
>
>> I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
>> query on the entity. For an entity with about 6 million rows, it will take
>> about 35 seconds to load it to RDD. Is it expected? Is there any way to
>> shorten the loading process? I have been getting some tips from
>> http://hbase.apache.org/book/perf.reading.html to speed up the process,
>> e.g., scan.setCaching(cacheSize) and only add the necessary
>> attributes/column to scan. I am just wondering if there are other ways to
>> improve the speed?
>>
>> Here is the code snippet:
>>
>> SparkConf sparkConf = new
>> SparkConf().setMaster("spark://url").setAppName("SparkSQLTest");
>> JavaSparkContext jsc = new JavaSparkContext(sparkConf);
>> Configuration hbase_conf = HBaseConfiguration.create();
>> hbase_conf.set("hbase.zookeeper.quorum","url");
>> hbase_conf.set("hbase.regionserver.port", "60020");
>> hbase_conf.set("hbase.master", "url");
>> hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
>> Scan scan = new Scan();
>> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1"));
>> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2"));
>> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3"));
>> scan.setCaching(this.cacheSize);
>> hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
>> JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD
>> = jsc.newAPIHadoopRDD(hbase_conf,
>>             TableInputFormat.class, ImmutableBytesWritable.class,
>>             Result.class);
>> logger.info("count is " + hBaseRDD.cache().count());
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>

Re: Loading a large Hbase table into SPARK RDD takes quite long time

Posted by bonnahu <bo...@gmail.com>.

Hi,
Here is the configuration of the cluster:

Workers: 2
For each worker, 
Cores: 24 Total, 0 Used
Memory: 69.6 GB Total, 0.0 B Used
For the spark.executor.memory, I didn't set it, so it should be the default
value 512M.

How much space does one row only consisting of the 3 columns consume? 
the size of 3 columns are very small, probably less than 100 bytes.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396p20414.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Loading a large Hbase table into SPARK RDD takes quite long time

Posted by Jörn Franke <jo...@gmail.com>.

Hi,

What is your cluster setup? How mich memory do you have? How much space
does one row only consisting of the 3 columns consume? Do you run other
stuff in the background?

Best regards
Am 04.12.2014 23:57 schrieb "bonnahu" <bo...@gmail.com>:

> I am trying to load a large Hbase table into SPARK RDD to run a SparkSQL
> query on the entity. For an entity with about 6 million rows, it will take
> about 35 seconds to load it to RDD. Is it expected? Is there any way to
> shorten the loading process? I have been getting some tips from
> http://hbase.apache.org/book/perf.reading.html to speed up the process,
> e.g., scan.setCaching(cacheSize) and only add the necessary
> attributes/column to scan. I am just wondering if there are other ways to
> improve the speed?
>
> Here is the code snippet:
>
> SparkConf sparkConf = new
> SparkConf().setMaster("spark://url").setAppName("SparkSQLTest");
> JavaSparkContext jsc = new JavaSparkContext(sparkConf);
> Configuration hbase_conf = HBaseConfiguration.create();
> hbase_conf.set("hbase.zookeeper.quorum","url");
> hbase_conf.set("hbase.regionserver.port", "60020");
> hbase_conf.set("hbase.master", "url");
> hbase_conf.set(TableInputFormat.INPUT_TABLE, entityName);
> Scan scan = new Scan();
> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col1"));
> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col2"));
> scan.addColumn(Bytes.toBytes("MetaInfo"), Bytes.toBytes("col3"));
> scan.setCaching(this.cacheSize);
> hbase_conf.set(TableInputFormat.SCAN, convertScanToString(scan));
> JavaPairRDD<ImmutableBytesWritable, Result> hBaseRDD
> = jsc.newAPIHadoopRDD(hbase_conf,
>             TableInputFormat.class, ImmutableBytesWritable.class,
>             Result.class);
> logger.info("count is " + hBaseRDD.cache().count());
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Loading-a-large-Hbase-table-into-SPARK-RDD-takes-quite-long-time-tp20396.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>