You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by David Quigley <dq...@gmail.com> on 2014/04/09 18:02:14 UTC

hbase scan performance

Hi all,

We are currently using hbase to store user data and periodically doing a
full scan to aggregate data. The reason we use hbase is that we need a
single user's data to be contiguous, so as user data comes in, we need the
ability to update a random access store.

The performance of a full hbase scan with MapReduce is frustratingly slow,
despite implementing recommended optimizations. I see that it is possible
to scan hbase with Spark, but am not familiar with how Spark interfaces
with hbase. Would you expect the scan to perform similarly if used as a
Spark input as a MapReduce input?

Thanks,
Dave

Re: hbase scan performance

Posted by Patrick Wendell <pw...@gmail.com>.

This job might still be faster... in MapReduce there will be other
overheads in addition to the fact that doing sequential reads from HBase is
slow. But it's possible the bottleneck is the HBase scan performance.

- Patrick


On Wed, Apr 9, 2014 at 10:10 AM, Jerry Lam <ch...@gmail.com> wrote:

> Hi Dave,
>
> This is HBase solution to the poor scan performance issue:
> https://issues.apache.org/jira/browse/HBASE-8369
>
> I encountered the same issue before.
> To the best of my knowledge, this is not a mapreduce issue. It is hbase
> issue. If you are planning to swap out mapreduce and replace it with spark,
> I don't think you can get a lot of performance from scanning HBase unless
> you are talking about caching the results from HBase in spark and reuse it
> over and over.
>
> HTH,
>
> Jerry
>
>
> On Wed, Apr 9, 2014 at 12:02 PM, David Quigley <dq...@gmail.com>wrote:
>
>> Hi all,
>>
>> We are currently using hbase to store user data and periodically doing a
>> full scan to aggregate data. The reason we use hbase is that we need a
>> single user's data to be contiguous, so as user data comes in, we need the
>> ability to update a random access store.
>>
>> The performance of a full hbase scan with MapReduce is frustratingly
>> slow, despite implementing recommended optimizations. I see that it is
>> possible to scan hbase with Spark, but am not familiar with how Spark
>> interfaces with hbase. Would you expect the scan to perform similarly if
>> used as a Spark input as a MapReduce input?
>>
>> Thanks,
>> Dave
>>
>
>

Re: hbase scan performance

Posted by Jerry Lam <ch...@gmail.com>.

Hi Dave,

This is HBase solution to the poor scan performance issue:
https://issues.apache.org/jira/browse/HBASE-8369

I encountered the same issue before.
To the best of my knowledge, this is not a mapreduce issue. It is hbase
issue. If you are planning to swap out mapreduce and replace it with spark,
I don't think you can get a lot of performance from scanning HBase unless
you are talking about caching the results from HBase in spark and reuse it
over and over.

HTH,

Jerry

On Wed, Apr 9, 2014 at 12:02 PM, David Quigley <dq...@gmail.com> wrote:

> Hi all,
>
> We are currently using hbase to store user data and periodically doing a
> full scan to aggregate data. The reason we use hbase is that we need a
> single user's data to be contiguous, so as user data comes in, we need the
> ability to update a random access store.
>
> The performance of a full hbase scan with MapReduce is frustratingly slow,
> despite implementing recommended optimizations. I see that it is possible
> to scan hbase with Spark, but am not familiar with how Spark interfaces
> with hbase. Would you expect the scan to perform similarly if used as a
> Spark input as a MapReduce input?
>
> Thanks,
> Dave
>