You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Manjeet Singh <ma...@gmail.com> on 2018/08/30 05:11:24 UTC

Query on rowkey distribution || Does RS and number of Region related with each other

Hi All,



I have two Question

*Question 1 : *

I want to understand how rowkey distribution happen if I create my table
with out applying any policy but opting prefix salting.

Example I have rowkey like

SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp

So it will look like as below

*_99_1516838400_1516924800_1516865160

Question is : now I can not see that my data is getting distributed only
because of salt.

So does I have only choice of pre splitting? Or do I have any other option?

I have seen two more approaches

i.e.

hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit
-c 10 -f f1

I guess its scope is limited as number of region created at the time table
creation and it will fix? Not sure.

and

*UniformSplit
<https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html>*



*Second 2: Does number of split point anywhere related to the number of RS
in cluster, If yes what is the calculation? *

Re: Query on rowkey distribution || Does RS and number of Region related with each other

Posted by Josh Elser <el...@apache.org>.

Manjeet -- you are still missing the fact that if you do not split your 
table into multiple regions, your data will not be distributed.

Why do you think that your rowkey design means you can't split your table?

On 9/3/18 6:09 AM, Manjeet Singh wrote:
> Hi Josh
> 
> Sharing steps and my findings for better understanding:
> 
> 
> I have tested on below table creation policy (considering that I am 100%
> aware of pre-splitting but can't use as per our rowkey design)
> 
> I have to opt some different policy which can evenly distribute the data to
> all Regions
> 
> #1
> hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit
> -c 10 -f f1
> alter 'test_table', { NAME => 'si', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
> alter 'test_table', {NAME => 'si', COMPRESSION => 'SNAPPY'
> 
> 
> #2
> create 'TEST_TABLE_KeyPrefixRegionSplitPolicy', {NAME => 'si'}, CONFIG =>
> {'KeyPrefixRegionSplitPolicy.prefix_length'=> '5'}
> alter 'TEST_TABLE_KeyPrefixRegionSplitPolicy', { NAME => 'si',
> DATA_BLOCK_ENCODING => 'FAST_DIFF' }
> alter 'TEST_TABLE_KeyPrefixRegionSplitPolicy', {NAME => 'si', COMPRESSION
> => 'SNAPPY'
> 
> 
> 
> #3 Currently I am consdring it and want to distribute data only based on
> rowkey
> create 'TEST_TABLE','si',{ NAME => 'si', COMPRESSION => 'SNAPPY' }
> alter 'TEST_TABLE', { NAME => 'si', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
> alter 'TEST_TABLE', {NAME => 'si', COMPRESSION => 'SNAPPY' }
> 
> 
> Thanks
> Manjeet Singh
> 
> 
> 
> On Fri, Aug 31, 2018 at 6:49 PM, Josh Elser <el...@apache.org> wrote:
> 
>> I'd like to remind you again that we're all volunteers and we're helping
>> you because we choose to do so. Antagonizing those who are helping you is a
>> great way to stop receiving any free help.
>>
>> If you do not create more than one Region, HBase will not distribute your
>> data on more than one RegionServer. Full stop.
>>
>>
>> On 8/30/18 2:16 PM, Manjeet Singh wrote:
>>
>>> Hi Elser
>>>
>>> I have clearly total about rowkey does I am talking about data? see below
>>> what I have told about rowkey
>>>
>>> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
>>>
>>> Problem is this you are not understanding the question and just telling
>>> what you know, even on slack you are saying same thing.
>>> Question is simple if I put salt (which can be any arbit char or genrated
>>> hash any thing) at the begning of the rowkey why my data not getting
>>> distributed
>>> Please note this is not pre splitted table.
>>>
>>> Thanks
>>> Manjeet Singh
>>>
>>> On Thu, Aug 30, 2018 at 9:11 PM Josh Elser <el...@apache.org> wrote:
>>>
>>> As I've been trying to explain in Slack:
>>>>
>>>> 1. Are you including the salt in the data that you are writing, such
>>>> that you are spreading the data across all Regions per their boundaries?
>>>> Or, as I think you are, just creating split points with this arbitrary
>>>> "salt" and not including it when you write data?
>>>>
>>>> If, as I am assuming, you are not, all of your data will go into the
>>>> first or last region. If you are still not getting my point, I'd suggest
>>>> that you share the exact splitpoints and one rowkey that you are writing
>>>> to HBase. That will make it quite clear if my guess is correct or not.
>>>>
>>>> 2. The number of Regions controls the number of RegionServers that will
>>>> be involved with reads/writes against that table. This is a calculation
>>>> that you need to figure out based on your cluster configuration and the
>>>> magnitude of your workload.
>>>>
>>>> On 8/30/18 1:11 AM, Manjeet Singh wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>>
>>>>>
>>>>> I have two Question
>>>>>
>>>>> *Question 1 : *
>>>>>
>>>>> I want to understand how rowkey distribution happen if I create my table
>>>>> with out applying any policy but opting prefix salting.
>>>>>
>>>>> Example I have rowkey like
>>>>>
>>>>> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
>>>>>
>>>>> So it will look like as below
>>>>>
>>>>> *_99_1516838400_1516924800_1516865160
>>>>>
>>>>> Question is : now I can not see that my data is getting distributed only
>>>>> because of salt.
>>>>>
>>>>> So does I have only choice of pre splitting? Or do I have any other
>>>>>
>>>> option?
>>>>
>>>>>
>>>>> I have seen two more approaches
>>>>>
>>>>> i.e.
>>>>>
>>>>> hbase org.apache.hadoop.hbase.util.RegionSplitter test_table
>>>>>
>>>> HexStringSplit
>>>>
>>>>> -c 10 -f f1
>>>>>
>>>>> I guess its scope is limited as number of region created at the time
>>>>>
>>>> table
>>>>
>>>>> creation and it will fix? Not sure.
>>>>>
>>>>> and
>>>>>
>>>>> *UniformSplit
>>>>> <
>>>>>
>>>> https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbas
>>>> e/util/RegionSplitter.UniformSplit.html
>>>>
>>>>> *
>>>>>
>>>>>
>>>>>
>>>>> *Second 2: Does number of split point anywhere related to the number of
>>>>>
>>>> RS
>>>>
>>>>> in cluster, If yes what is the calculation? *
>>>>>
>>>>>
>>>>
>>>
>>>
> 
>

Re: Query on rowkey distribution || Does RS and number of Region related with each other

Posted by Manjeet Singh <ma...@gmail.com>.

Hi Josh

Sharing steps and my findings for better understanding:


I have tested on below table creation policy (considering that I am 100%
aware of pre-splitting but can't use as per our rowkey design)

I have to opt some different policy which can evenly distribute the data to
all Regions

#1
hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit
-c 10 -f f1
alter 'test_table', { NAME => 'si', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
alter 'test_table', {NAME => 'si', COMPRESSION => 'SNAPPY'


#2
create 'TEST_TABLE_KeyPrefixRegionSplitPolicy', {NAME => 'si'}, CONFIG =>
{'KeyPrefixRegionSplitPolicy.prefix_length'=> '5'}
alter 'TEST_TABLE_KeyPrefixRegionSplitPolicy', { NAME => 'si',
DATA_BLOCK_ENCODING => 'FAST_DIFF' }
alter 'TEST_TABLE_KeyPrefixRegionSplitPolicy', {NAME => 'si', COMPRESSION
=> 'SNAPPY'



#3 Currently I am consdring it and want to distribute data only based on
rowkey
create 'TEST_TABLE','si',{ NAME => 'si', COMPRESSION => 'SNAPPY' }
alter 'TEST_TABLE', { NAME => 'si', DATA_BLOCK_ENCODING => 'FAST_DIFF' }
alter 'TEST_TABLE', {NAME => 'si', COMPRESSION => 'SNAPPY' }


Thanks
Manjeet Singh



On Fri, Aug 31, 2018 at 6:49 PM, Josh Elser <el...@apache.org> wrote:

> I'd like to remind you again that we're all volunteers and we're helping
> you because we choose to do so. Antagonizing those who are helping you is a
> great way to stop receiving any free help.
>
> If you do not create more than one Region, HBase will not distribute your
> data on more than one RegionServer. Full stop.
>
>
> On 8/30/18 2:16 PM, Manjeet Singh wrote:
>
>> Hi Elser
>>
>> I have clearly total about rowkey does I am talking about data? see below
>> what I have told about rowkey
>>
>> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
>>
>> Problem is this you are not understanding the question and just telling
>> what you know, even on slack you are saying same thing.
>> Question is simple if I put salt (which can be any arbit char or genrated
>> hash any thing) at the begning of the rowkey why my data not getting
>> distributed
>> Please note this is not pre splitted table.
>>
>> Thanks
>> Manjeet Singh
>>
>> On Thu, Aug 30, 2018 at 9:11 PM Josh Elser <el...@apache.org> wrote:
>>
>> As I've been trying to explain in Slack:
>>>
>>> 1. Are you including the salt in the data that you are writing, such
>>> that you are spreading the data across all Regions per their boundaries?
>>> Or, as I think you are, just creating split points with this arbitrary
>>> "salt" and not including it when you write data?
>>>
>>> If, as I am assuming, you are not, all of your data will go into the
>>> first or last region. If you are still not getting my point, I'd suggest
>>> that you share the exact splitpoints and one rowkey that you are writing
>>> to HBase. That will make it quite clear if my guess is correct or not.
>>>
>>> 2. The number of Regions controls the number of RegionServers that will
>>> be involved with reads/writes against that table. This is a calculation
>>> that you need to figure out based on your cluster configuration and the
>>> magnitude of your workload.
>>>
>>> On 8/30/18 1:11 AM, Manjeet Singh wrote:
>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> I have two Question
>>>>
>>>> *Question 1 : *
>>>>
>>>> I want to understand how rowkey distribution happen if I create my table
>>>> with out applying any policy but opting prefix salting.
>>>>
>>>> Example I have rowkey like
>>>>
>>>> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
>>>>
>>>> So it will look like as below
>>>>
>>>> *_99_1516838400_1516924800_1516865160
>>>>
>>>> Question is : now I can not see that my data is getting distributed only
>>>> because of salt.
>>>>
>>>> So does I have only choice of pre splitting? Or do I have any other
>>>>
>>> option?
>>>
>>>>
>>>> I have seen two more approaches
>>>>
>>>> i.e.
>>>>
>>>> hbase org.apache.hadoop.hbase.util.RegionSplitter test_table
>>>>
>>> HexStringSplit
>>>
>>>> -c 10 -f f1
>>>>
>>>> I guess its scope is limited as number of region created at the time
>>>>
>>> table
>>>
>>>> creation and it will fix? Not sure.
>>>>
>>>> and
>>>>
>>>> *UniformSplit
>>>> <
>>>>
>>> https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbas
>>> e/util/RegionSplitter.UniformSplit.html
>>>
>>>> *
>>>>
>>>>
>>>>
>>>> *Second 2: Does number of split point anywhere related to the number of
>>>>
>>> RS
>>>
>>>> in cluster, If yes what is the calculation? *
>>>>
>>>>
>>>
>>
>>


-- 
luv all

Re: Query on rowkey distribution || Does RS and number of Region related with each other

Posted by Josh Elser <el...@apache.org>.

I'd like to remind you again that we're all volunteers and we're helping 
you because we choose to do so. Antagonizing those who are helping you 
is a great way to stop receiving any free help.

If you do not create more than one Region, HBase will not distribute 
your data on more than one RegionServer. Full stop.

On 8/30/18 2:16 PM, Manjeet Singh wrote:
> Hi Elser
> 
> I have clearly total about rowkey does I am talking about data? see below
> what I have told about rowkey
> 
> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
> 
> Problem is this you are not understanding the question and just telling
> what you know, even on slack you are saying same thing.
> Question is simple if I put salt (which can be any arbit char or genrated
> hash any thing) at the begning of the rowkey why my data not getting
> distributed
> Please note this is not pre splitted table.
> 
> Thanks
> Manjeet Singh
> 
> On Thu, Aug 30, 2018 at 9:11 PM Josh Elser <el...@apache.org> wrote:
> 
>> As I've been trying to explain in Slack:
>>
>> 1. Are you including the salt in the data that you are writing, such
>> that you are spreading the data across all Regions per their boundaries?
>> Or, as I think you are, just creating split points with this arbitrary
>> "salt" and not including it when you write data?
>>
>> If, as I am assuming, you are not, all of your data will go into the
>> first or last region. If you are still not getting my point, I'd suggest
>> that you share the exact splitpoints and one rowkey that you are writing
>> to HBase. That will make it quite clear if my guess is correct or not.
>>
>> 2. The number of Regions controls the number of RegionServers that will
>> be involved with reads/writes against that table. This is a calculation
>> that you need to figure out based on your cluster configuration and the
>> magnitude of your workload.
>>
>> On 8/30/18 1:11 AM, Manjeet Singh wrote:
>>> Hi All,
>>>
>>>
>>>
>>> I have two Question
>>>
>>> *Question 1 : *
>>>
>>> I want to understand how rowkey distribution happen if I create my table
>>> with out applying any policy but opting prefix salting.
>>>
>>> Example I have rowkey like
>>>
>>> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
>>>
>>> So it will look like as below
>>>
>>> *_99_1516838400_1516924800_1516865160
>>>
>>> Question is : now I can not see that my data is getting distributed only
>>> because of salt.
>>>
>>> So does I have only choice of pre splitting? Or do I have any other
>> option?
>>>
>>> I have seen two more approaches
>>>
>>> i.e.
>>>
>>> hbase org.apache.hadoop.hbase.util.RegionSplitter test_table
>> HexStringSplit
>>> -c 10 -f f1
>>>
>>> I guess its scope is limited as number of region created at the time
>> table
>>> creation and it will fix? Not sure.
>>>
>>> and
>>>
>>> *UniformSplit
>>> <
>> https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html
>>> *
>>>
>>>
>>>
>>> *Second 2: Does number of split point anywhere related to the number of
>> RS
>>> in cluster, If yes what is the calculation? *
>>>
>>
> 
>

Re: Query on rowkey distribution || Does RS and number of Region related with each other

Posted by Manjeet Singh <ma...@gmail.com>.

Hi Elser

I have clearly total about rowkey does I am talking about data? see below
what I have told about rowkey

SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp

Problem is this you are not understanding the question and just telling
what you know, even on slack you are saying same thing.
Question is simple if I put salt (which can be any arbit char or genrated
hash any thing) at the begning of the rowkey why my data not getting
distributed
Please note this is not pre splitted table.

Thanks
Manjeet Singh

On Thu, Aug 30, 2018 at 9:11 PM Josh Elser <el...@apache.org> wrote:

> As I've been trying to explain in Slack:
>
> 1. Are you including the salt in the data that you are writing, such
> that you are spreading the data across all Regions per their boundaries?
> Or, as I think you are, just creating split points with this arbitrary
> "salt" and not including it when you write data?
>
> If, as I am assuming, you are not, all of your data will go into the
> first or last region. If you are still not getting my point, I'd suggest
> that you share the exact splitpoints and one rowkey that you are writing
> to HBase. That will make it quite clear if my guess is correct or not.
>
> 2. The number of Regions controls the number of RegionServers that will
> be involved with reads/writes against that table. This is a calculation
> that you need to figure out based on your cluster configuration and the
> magnitude of your workload.
>
> On 8/30/18 1:11 AM, Manjeet Singh wrote:
> > Hi All,
> >
> >
> >
> > I have two Question
> >
> > *Question 1 : *
> >
> > I want to understand how rowkey distribution happen if I create my table
> > with out applying any policy but opting prefix salting.
> >
> > Example I have rowkey like
> >
> > SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
> >
> > So it will look like as below
> >
> > *_99_1516838400_1516924800_1516865160
> >
> > Question is : now I can not see that my data is getting distributed only
> > because of salt.
> >
> > So does I have only choice of pre splitting? Or do I have any other
> option?
> >
> > I have seen two more approaches
> >
> > i.e.
> >
> > hbase org.apache.hadoop.hbase.util.RegionSplitter test_table
> HexStringSplit
> > -c 10 -f f1
> >
> > I guess its scope is limited as number of region created at the time
> table
> > creation and it will fix? Not sure.
> >
> > and
> >
> > *UniformSplit
> > <
> https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html
> >*
> >
> >
> >
> > *Second 2: Does number of split point anywhere related to the number of
> RS
> > in cluster, If yes what is the calculation? *
> >
>


-- 
luv all

Re: Query on rowkey distribution || Does RS and number of Region related with each other

Posted by Josh Elser <el...@apache.org>.

As I've been trying to explain in Slack:

1. Are you including the salt in the data that you are writing, such 
that you are spreading the data across all Regions per their boundaries? 
Or, as I think you are, just creating split points with this arbitrary 
"salt" and not including it when you write data?

If, as I am assuming, you are not, all of your data will go into the 
first or last region. If you are still not getting my point, I'd suggest 
that you share the exact splitpoints and one rowkey that you are writing 
to HBase. That will make it quite clear if my guess is correct or not.

2. The number of Regions controls the number of RegionServers that will 
be involved with reads/writes against that table. This is a calculation 
that you need to figure out based on your cluster configuration and the 
magnitude of your workload.

On 8/30/18 1:11 AM, Manjeet Singh wrote:
> Hi All,
> 
> 
> 
> I have two Question
> 
> *Question 1 : *
> 
> I want to understand how rowkey distribution happen if I create my table
> with out applying any policy but opting prefix salting.
> 
> Example I have rowkey like
> 
> SALT_ID_DayStartTimestamp_DayEndTimeStamp_IDTimeStamp
> 
> So it will look like as below
> 
> *_99_1516838400_1516924800_1516865160
> 
> Question is : now I can not see that my data is getting distributed only
> because of salt.
> 
> So does I have only choice of pre splitting? Or do I have any other option?
> 
> I have seen two more approaches
> 
> i.e.
> 
> hbase org.apache.hadoop.hbase.util.RegionSplitter test_table HexStringSplit
> -c 10 -f f1
> 
> I guess its scope is limited as number of region created at the time table
> creation and it will fix? Not sure.
> 
> and
> 
> *UniformSplit
> <https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/util/RegionSplitter.UniformSplit.html>*
> 
> 
> 
> *Second 2: Does number of split point anywhere related to the number of RS
> in cluster, If yes what is the calculation? *
>