You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Stuart Layton <st...@gmail.com> on 2015/03/26 14:46:06 UTC

RDD equivalent of HBase Scan

HBase scans come with the ability to specify filters that make scans very
fast and efficient (as they let you seek for the keys that pass the filter).

Do RDD's or Spark DataFrames offer anything similar or would I be required
to use a NoSQL db like HBase to do something like this?

-- 
Stuart Layton

Re: RDD equivalent of HBase Scan

Posted by Sean Owen <so...@cloudera.com>.

An RDD is a very different creature than a NoSQL store, so I would not
think of them as in the same ball-park for NoSQL-like workloads. It's
not built for point queries or range scans, since any request would
launch a distributed job to scan all partitions. It's not something
built for, say, thousands of concurrent jobs (queries).

On Thu, Mar 26, 2015 at 1:57 PM, Stuart Layton <st...@gmail.com> wrote:
> Thanks but I'm hoping to get away from hbase all together. I was wondering
> if there is a way to get similar scan performance directly on cached rdd's
> or data frames
>
> On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu <yu...@gmail.com> wrote:
>>
>> In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
>> TableInputFormat is used.
>> TableInputFormat accepts parameter
>>
>>   public static final String SCAN = "hbase.mapreduce.scan";
>>
>> where if specified, Scan object would be created from String form:
>>
>>     if (conf.get(SCAN) != null) {
>>
>>       try {
>>
>>         scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN));
>>
>> You can use TableMapReduceUtil#convertScanToString() to convert a Scan
>> which has filter(s) and pass to TableInputFormat
>>
>> Cheers
>>
>>
>> On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton <st...@gmail.com>
>> wrote:
>>>
>>> HBase scans come with the ability to specify filters that make scans very
>>> fast and efficient (as they let you seek for the keys that pass the filter).
>>>
>>> Do RDD's or Spark DataFrames offer anything similar or would I be
>>> required to use a NoSQL db like HBase to do something like this?
>>>
>>> --
>>> Stuart Layton
>>
>>
>
>
>
> --
> Stuart Layton

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: RDD equivalent of HBase Scan

Posted by Stuart Layton <st...@gmail.com>.

Thanks but I'm hoping to get away from hbase all together. I was wondering
if there is a way to get similar scan performance directly on cached rdd's
or data frames

On Thu, Mar 26, 2015 at 9:54 AM, Ted Yu <yu...@gmail.com> wrote:

> In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
> TableInputFormat is used.
> TableInputFormat accepts parameter
>
>   public static final String SCAN = "hbase.mapreduce.scan";
>
> where if specified, Scan object would be created from String form:
>
>     if (conf.get(SCAN) != null) {
>
>       try {
>
>         scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN));
>
> You can use TableMapReduceUtil#convertScanToString() to convert a Scan
> which has filter(s) and pass to TableInputFormat
>
> Cheers
>
> On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton <st...@gmail.com>
> wrote:
>
>> HBase scans come with the ability to specify filters that make scans very
>> fast and efficient (as they let you seek for the keys that pass the filter).
>>
>> Do RDD's or Spark DataFrames offer anything similar or would I be
>> required to use a NoSQL db like HBase to do something like this?
>>
>> --
>> Stuart Layton
>>
>
>


-- 
Stuart Layton

Re: RDD equivalent of HBase Scan

Posted by Ted Yu <yu...@gmail.com>.

In examples//src/main/scala/org/apache/spark/examples/HBaseTest.scala,
TableInputFormat is used.
TableInputFormat accepts parameter

  public static final String SCAN = "hbase.mapreduce.scan";

where if specified, Scan object would be created from String form:

    if (conf.get(SCAN) != null) {

      try {

        scan = TableMapReduceUtil.convertStringToScan(conf.get(SCAN));

You can use TableMapReduceUtil#convertScanToString() to convert a Scan
which has filter(s) and pass to TableInputFormat

Cheers

On Thu, Mar 26, 2015 at 6:46 AM, Stuart Layton <st...@gmail.com>
wrote:

> HBase scans come with the ability to specify filters that make scans very
> fast and efficient (as they let you seek for the keys that pass the filter).
>
> Do RDD's or Spark DataFrames offer anything similar or would I be required
> to use a NoSQL db like HBase to do something like this?
>
> --
> Stuart Layton
>