You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shushant Arora <sh...@gmail.com> on 2015/07/15 20:46:16 UTC

spark streaming job to hbase write

Hi

I have a requirement of writing in hbase table from Spark streaming app
after some processing.
Is Hbase put operation the only way of writing to hbase or is there any
specialised connector or rdd of spark for hbase write.

Should Bulk load to hbase from streaming  app be avoided if output of each
batch interval is just few mbs?

Thanks

Re: spark streaming job to hbase write

Posted by Michael Segel <ms...@hotmail.com>.
You ask an interesting question… 

Lets set aside spark, and look at the overall ingestion pattern. 

Its really an ingestion pattern where your input in to the system is from a queue. 

Are the events discrete or continuous? (This is kinda important.) 

If the events are continuous then more than likely you’re going to be ingesting data where the key is somewhat sequential. If you use put(), you end up with hot spotting. And you’ll end up with regions half full. 
So you would be better off batching up the data and doing bulk imports. 

If the events are discrete, then you’ll want to use put() because the odds are you will not be using a sequential key. (You could, but I’d suggest that you rethink your primary key) 

Depending on the rate of ingestion, you may want to do a manual flush. (It depends on the velocity of data to be ingested and your use case )
(Remember what caching occurs and where when dealing with HBase.) 

A third option… Depending on how you use the data, you may want to avoid storing the data in HBase, and only use HBase as an index to where you store the data files for quick access.  Again it depends on your data ingestion flow and how you intend to use the data. 

So really this is less a spark issue than an HBase issue when it comes to design. 

HTH

-Mike
> On Jul 15, 2015, at 11:46 AM, Shushant Arora <sh...@gmail.com> wrote:
> 
> Hi
> 
> I have a requirement of writing in hbase table from Spark streaming app after some processing.
> Is Hbase put operation the only way of writing to hbase or is there any specialised connector or rdd of spark for hbase write.
> 
> Should Bulk load to hbase from streaming  app be avoided if output of each batch interval is just few mbs?
> 
> Thanks
> 



---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: spark streaming job to hbase write

Posted by Todd Nist <ts...@gmail.com>.
There are there connector packages listed on spark packages web site:

http://spark-packages.org/?q=hbase

HTH.

-Todd

On Wed, Jul 15, 2015 at 2:46 PM, Shushant Arora <sh...@gmail.com>
wrote:

> Hi
>
> I have a requirement of writing in hbase table from Spark streaming app
> after some processing.
> Is Hbase put operation the only way of writing to hbase or is there any
> specialised connector or rdd of spark for hbase write.
>
> Should Bulk load to hbase from streaming  app be avoided if output of each
> batch interval is just few mbs?
>
> Thanks
>
>

Re: spark streaming job to hbase write

Posted by Ted Yu <yu...@gmail.com>.
It resorts to the following method for finding region location:

  private RegionLocations locateRegionInMeta(TableName tableName, byte[]
row,

                 boolean useCache, boolean retry, int replicaId) throws
IOException {

Note: useCache value is true in this call path.

Meaning the client side cache would be consulted to reduce RPC to server
hosting hbase:meta

Cheers

On Fri, Jul 17, 2015 at 7:41 AM, Shushant Arora <sh...@gmail.com>
wrote:

> Is this map creation happening on client side ?
>
> But how does it know which RS will contain that row key in put operation
> until asking the .Meta. table .
>  Does Hbase client first gets that ranges of keys of each Reagionservers
> and then group put objects based on Region servers ?
>
> On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> Internally AsyncProcess uses a Map which is keyed by server name:
>>
>>     Map<ServerName, MultiAction<Row>> actionsByServer =
>>
>>         new HashMap<ServerName, MultiAction<Row>>();
>>
>> Here MultiAction would group Put's in your example which are destined for
>> the same server.
>>
>> Cheers
>>
>> On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <
>> shushantarora09@gmail.com> wrote:
>>
>>> Thanks !
>>>
>>> My key is random (hexadecimal). So hot spot should not be created.
>>>
>>> Is there any concept of bulk put. Say I want to raise a one put request
>>> for a 1000 size batch which will hit a region server instead of individual
>>> put for each key.
>>>
>>>
>>> Htable.put(List<Put>) Does this handles batching of put based on
>>> regionserver to which they will land to finally. Say in my batch there are
>>> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <
>>> michael_segel@hotmail.com> wrote:
>>>
>>>> You ask an interesting question…
>>>>
>>>> Lets set aside spark, and look at the overall ingestion pattern.
>>>>
>>>> Its really an ingestion pattern where your input in to the system is
>>>> from a queue.
>>>>
>>>> Are the events discrete or continuous? (This is kinda important.)
>>>>
>>>> If the events are continuous then more than likely you’re going to be
>>>> ingesting data where the key is somewhat sequential. If you use put(), you
>>>> end up with hot spotting. And you’ll end up with regions half full.
>>>> So you would be better off batching up the data and doing bulk imports.
>>>>
>>>> If the events are discrete, then you’ll want to use put() because the
>>>> odds are you will not be using a sequential key. (You could, but I’d
>>>> suggest that you rethink your primary key)
>>>>
>>>> Depending on the rate of ingestion, you may want to do a manual flush.
>>>> (It depends on the velocity of data to be ingested and your use case )
>>>> (Remember what caching occurs and where when dealing with HBase.)
>>>>
>>>> A third option… Depending on how you use the data, you may want to
>>>> avoid storing the data in HBase, and only use HBase as an index to where
>>>> you store the data files for quick access.  Again it depends on your data
>>>> ingestion flow and how you intend to use the data.
>>>>
>>>> So really this is less a spark issue than an HBase issue when it comes
>>>> to design.
>>>>
>>>> HTH
>>>>
>>>> -Mike
>>>>
>>>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <
>>>> shushantarora09@gmail.com> wrote:
>>>> >
>>>> > Hi
>>>> >
>>>> > I have a requirement of writing in hbase table from Spark streaming
>>>> app after some processing.
>>>> > Is Hbase put operation the only way of writing to hbase or is there
>>>> any specialised connector or rdd of spark for hbase write.
>>>> >
>>>> > Should Bulk load to hbase from streaming  app be avoided if output of
>>>> each batch interval is just few mbs?
>>>> >
>>>> > Thanks
>>>> >
>>>>
>>>> The opinions expressed here are mine, while they may reflect a
>>>> cognitive thought, that is purely accidental.
>>>> Use at your own risk.
>>>> Michael Segel
>>>> michael_segel (AT) hotmail.com
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: spark streaming job to hbase write

Posted by Shushant Arora <sh...@gmail.com>.
Is this map creation happening on client side ?

But how does it know which RS will contain that row key in put operation
until asking the .Meta. table .
 Does Hbase client first gets that ranges of keys of each Reagionservers
and then group put objects based on Region servers ?

On Fri, Jul 17, 2015 at 7:48 PM, Ted Yu <yu...@gmail.com> wrote:

> Internally AsyncProcess uses a Map which is keyed by server name:
>
>     Map<ServerName, MultiAction<Row>> actionsByServer =
>
>         new HashMap<ServerName, MultiAction<Row>>();
>
> Here MultiAction would group Put's in your example which are destined for
> the same server.
>
> Cheers
>
> On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <shushantarora09@gmail.com
> > wrote:
>
>> Thanks !
>>
>> My key is random (hexadecimal). So hot spot should not be created.
>>
>> Is there any concept of bulk put. Say I want to raise a one put request
>> for a 1000 size batch which will hit a region server instead of individual
>> put for each key.
>>
>>
>> Htable.put(List<Put>) Does this handles batching of put based on
>> regionserver to which they will land to finally. Say in my batch there are
>> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <michael_segel@hotmail.com
>> > wrote:
>>
>>> You ask an interesting question…
>>>
>>> Lets set aside spark, and look at the overall ingestion pattern.
>>>
>>> Its really an ingestion pattern where your input in to the system is
>>> from a queue.
>>>
>>> Are the events discrete or continuous? (This is kinda important.)
>>>
>>> If the events are continuous then more than likely you’re going to be
>>> ingesting data where the key is somewhat sequential. If you use put(), you
>>> end up with hot spotting. And you’ll end up with regions half full.
>>> So you would be better off batching up the data and doing bulk imports.
>>>
>>> If the events are discrete, then you’ll want to use put() because the
>>> odds are you will not be using a sequential key. (You could, but I’d
>>> suggest that you rethink your primary key)
>>>
>>> Depending on the rate of ingestion, you may want to do a manual flush.
>>> (It depends on the velocity of data to be ingested and your use case )
>>> (Remember what caching occurs and where when dealing with HBase.)
>>>
>>> A third option… Depending on how you use the data, you may want to avoid
>>> storing the data in HBase, and only use HBase as an index to where you
>>> store the data files for quick access.  Again it depends on your data
>>> ingestion flow and how you intend to use the data.
>>>
>>> So really this is less a spark issue than an HBase issue when it comes
>>> to design.
>>>
>>> HTH
>>>
>>> -Mike
>>>
>>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <
>>> shushantarora09@gmail.com> wrote:
>>> >
>>> > Hi
>>> >
>>> > I have a requirement of writing in hbase table from Spark streaming
>>> app after some processing.
>>> > Is Hbase put operation the only way of writing to hbase or is there
>>> any specialised connector or rdd of spark for hbase write.
>>> >
>>> > Should Bulk load to hbase from streaming  app be avoided if output of
>>> each batch interval is just few mbs?
>>> >
>>> > Thanks
>>> >
>>>
>>> The opinions expressed here are mine, while they may reflect a cognitive
>>> thought, that is purely accidental.
>>> Use at your own risk.
>>> Michael Segel
>>> michael_segel (AT) hotmail.com
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: spark streaming job to hbase write

Posted by Ted Yu <yu...@gmail.com>.
Internally AsyncProcess uses a Map which is keyed by server name:

    Map<ServerName, MultiAction<Row>> actionsByServer =

        new HashMap<ServerName, MultiAction<Row>>();

Here MultiAction would group Put's in your example which are destined for
the same server.

Cheers

On Fri, Jul 17, 2015 at 5:15 AM, Shushant Arora <sh...@gmail.com>
wrote:

> Thanks !
>
> My key is random (hexadecimal). So hot spot should not be created.
>
> Is there any concept of bulk put. Say I want to raise a one put request
> for a 1000 size batch which will hit a region server instead of individual
> put for each key.
>
>
> Htable.put(List<Put>) Does this handles batching of put based on
> regionserver to which they will land to finally. Say in my batch there are
> 10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?
>
>
>
>
>
>
>
>
>
> On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <mi...@hotmail.com>
> wrote:
>
>> You ask an interesting question…
>>
>> Lets set aside spark, and look at the overall ingestion pattern.
>>
>> Its really an ingestion pattern where your input in to the system is from
>> a queue.
>>
>> Are the events discrete or continuous? (This is kinda important.)
>>
>> If the events are continuous then more than likely you’re going to be
>> ingesting data where the key is somewhat sequential. If you use put(), you
>> end up with hot spotting. And you’ll end up with regions half full.
>> So you would be better off batching up the data and doing bulk imports.
>>
>> If the events are discrete, then you’ll want to use put() because the
>> odds are you will not be using a sequential key. (You could, but I’d
>> suggest that you rethink your primary key)
>>
>> Depending on the rate of ingestion, you may want to do a manual flush.
>> (It depends on the velocity of data to be ingested and your use case )
>> (Remember what caching occurs and where when dealing with HBase.)
>>
>> A third option… Depending on how you use the data, you may want to avoid
>> storing the data in HBase, and only use HBase as an index to where you
>> store the data files for quick access.  Again it depends on your data
>> ingestion flow and how you intend to use the data.
>>
>> So really this is less a spark issue than an HBase issue when it comes to
>> design.
>>
>> HTH
>>
>> -Mike
>>
>> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <sh...@gmail.com>
>> wrote:
>> >
>> > Hi
>> >
>> > I have a requirement of writing in hbase table from Spark streaming app
>> after some processing.
>> > Is Hbase put operation the only way of writing to hbase or is there any
>> specialised connector or rdd of spark for hbase write.
>> >
>> > Should Bulk load to hbase from streaming  app be avoided if output of
>> each batch interval is just few mbs?
>> >
>> > Thanks
>> >
>>
>> The opinions expressed here are mine, while they may reflect a cognitive
>> thought, that is purely accidental.
>> Use at your own risk.
>> Michael Segel
>> michael_segel (AT) hotmail.com
>>
>>
>>
>>
>>
>>
>

Re: spark streaming job to hbase write

Posted by Shushant Arora <sh...@gmail.com>.
Thanks !

My key is random (hexadecimal). So hot spot should not be created.

Is there any concept of bulk put. Say I want to raise a one put request for
a 1000 size batch which will hit a region server instead of individual put
for each key.


Htable.put(List<Put>) Does this handles batching of put based on
regionserver to which they will land to finally. Say in my batch there are
10 puts- 5 for RS1,3 for RS3 and 2 for RS3. Does this handles that?









On Thu, Jul 16, 2015 at 8:31 PM, Michael Segel <mi...@hotmail.com>
wrote:

> You ask an interesting question…
>
> Lets set aside spark, and look at the overall ingestion pattern.
>
> Its really an ingestion pattern where your input in to the system is from
> a queue.
>
> Are the events discrete or continuous? (This is kinda important.)
>
> If the events are continuous then more than likely you’re going to be
> ingesting data where the key is somewhat sequential. If you use put(), you
> end up with hot spotting. And you’ll end up with regions half full.
> So you would be better off batching up the data and doing bulk imports.
>
> If the events are discrete, then you’ll want to use put() because the odds
> are you will not be using a sequential key. (You could, but I’d suggest
> that you rethink your primary key)
>
> Depending on the rate of ingestion, you may want to do a manual flush. (It
> depends on the velocity of data to be ingested and your use case )
> (Remember what caching occurs and where when dealing with HBase.)
>
> A third option… Depending on how you use the data, you may want to avoid
> storing the data in HBase, and only use HBase as an index to where you
> store the data files for quick access.  Again it depends on your data
> ingestion flow and how you intend to use the data.
>
> So really this is less a spark issue than an HBase issue when it comes to
> design.
>
> HTH
>
> -Mike
>
> > On Jul 15, 2015, at 11:46 AM, Shushant Arora <sh...@gmail.com>
> wrote:
> >
> > Hi
> >
> > I have a requirement of writing in hbase table from Spark streaming app
> after some processing.
> > Is Hbase put operation the only way of writing to hbase or is there any
> specialised connector or rdd of spark for hbase write.
> >
> > Should Bulk load to hbase from streaming  app be avoided if output of
> each batch interval is just few mbs?
> >
> > Thanks
> >
>
> The opinions expressed here are mine, while they may reflect a cognitive
> thought, that is purely accidental.
> Use at your own risk.
> Michael Segel
> michael_segel (AT) hotmail.com
>
>
>
>
>
>