You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Dan Han <da...@gmail.com> on 2012/09/26 02:25:34 UTC

Distribution of regions to servers

Hi all,

   I am doing some experiments on HBase with Coprocessor. I found that the
performance
of Coprocessor is impacted much by the distribution of the regions. I am
kind of interested in
going deep into this problem and see if I can do something.

  I only searched out the discussion in the following link.
http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers

I am wondering if there is any further discussion or any on-going work? Can
someone point it to me if there is?
Thanks in advance.

Best Wishes
Dan Han

RE: Distribution of regions to servers

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Hi Dan

Generally if the region distribution is not done properly as per the need
then always we end up in region server getting overloaded due to region
hotspotting.

Write thro put can go down.  It is not like the coprocessor performance
alone is slow.

Please check if the regions are properly balanced.  If you are using 0.92
and above you can use the options to balance by table or balance by region
server.

Regards
Ram

> -----Original Message-----
> From: Dan Han [mailto:dannahan2008@gmail.com]
> Sent: Wednesday, September 26, 2012 5:56 AM
> To: user@hbase.apache.org
> Subject: Distribution of regions to servers
> 
> Hi all,
> 
>    I am doing some experiments on HBase with Coprocessor. I found that
> the
> performance
> of Coprocessor is impacted much by the distribution of the regions. I
> am
> kind of interested in
> going deep into this problem and see if I can do something.
> 
>   I only searched out the discussion in the following link.
> http://search-
> hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> f+regions+to+servers
> 
> I am wondering if there is any further discussion or any on-going work?
> Can
> someone point it to me if there is?
> Thanks in advance.
> 
> Best Wishes
> Dan Han

Re: Distribution of regions to servers

Posted by Dan Han <da...@gmail.com>.

Thanks for your advice, Eugeny.

Best Wishes
Dan Han

On Thu, Sep 27, 2012 at 2:34 AM, Eugeny Morozov
<em...@griddynamics.com>wrote:

> Dan, see inlined.
>
> On Thu, Sep 27, 2012 at 5:30 AM, Dan Han <da...@gmail.com> wrote:
>
> > Hi, Eugeny ,
> >
> >    Thanks for your response. I answered your questions inline in Blue.
> > And I'd like to give an example to describe my problem.
> >
> > Let's think about two data schemas for the same dataset.
> > The two data schemas have different composite row keys.
>
>
> Just the first idea. If you have different schemas, then it would be much
> simpler to have two different tables with these schemas. Because in this
> case HBase itself automatically distribute each of the tables' regions
> evenly across the cluster. You could actually use the same coprocessor for
> both of the tables.
>
> In case you're using two different column families, you could specify
> different BLOCKSIZE  (default value is '65536''). You could set this option
> different in 10 times for CFs (as the difference in between your schemas).
> I believe this would decrease number of readings for larger data chunks.
>
> In general it is actually not good to have two (or more) really different
> in size column families, because they have compaction and flushing based on
> region, which means that if  HBase start compacting small column family it
> will do the same for big one.
> http://hbase.apache.org/book.html#number.of.cfs
>
> BTW, I don't think that coprocessors are good choice to have data mining.
> The reason is that it is kind of dangerous. Since coprocessor are server
> side creatures - they live in Region Server - they simply could get the
> whole system down. Expensive analysis creates heap and CPU pressure, which
> in turn lead to GC pauses and even more CPU pressure.
>
> Consider to use PIG and HBaseStorage to load data from HBase.
>
> But there is
> > a same part in both schemas, which represents a sequence ID.
> > In 1st schema, one row contains 1KB information;
> > while in 2nd schema, one row contains 10KB information.
> > So the number of rows in one region in 1st schema is more than
> > that in 2nd schema, right? If the queried data is based on the sequence
> ID,
> > as one region in 1st schema is responsible for more number of rows than
> > that in 2nd schema,
> > there would be more computation and long execution time for the
> > corresponding coprocessor.
> > So in this case, if the regions are not distributed well,
> > some region servers will suffer in excess workload.
> > That is why I want to do some management of regions to get better load
> > balance based on large queries.
> >
> > Hope it makes sense to you.
> >
> > Best Wishes
> > Dan Han
> >
> >
> > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > <em...@griddynamics.com>wrote:
> >
> > > Dan,
> > >
> > > I have additional questions.
> > > What is the access pattern of your queries? I mean that f.e.
> > PrefixFilters
> > > have to be applied for all KeyValue pairs in HFiles, which could be
> slow.
> > > Or f.e. scanner setCaching option is able to decrease number of network
> > > hops to get data from RegionServer.
> > >
> >
> >     I set the range of the rows and the related columns to narrow down
> the
> > scan scope,
> >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
> >     I set a little cache (5KB), but I kept it the same for all evaluated
> > data schema.
> >     Because I mainly focus on evaluate the performance of queries under
> the
> > different data schemas.
> >
> >
> > > Additionally, coprocessors are able to use InternalScanner instead of
> > > ResultScanner, which is also could help greatly.
> > >
> >
> >     yes, I used InternalScanner.
> >
> > >
> > > Also, the more dimension you specify, the more precise your query is,
> the
> > > less data is about to be processed - family, columns, timeranges, etc.
> > >
> > >
> > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com>
> wrote:
> > >
> > > >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > > > explicate what we are doing now below.
> > > >
> > > >    We are trying to explore a systematic way to design the
> appropriate
> > > data
> > > > schema for various applications in HBase. So we first designed
> several
> > > data
> > > > schemas for each dataset and evaluate them with the same queries.
>  The
> > > > queries are designed based on the requirements, such as selecting the
> > > data
> > > > with a matching expression, finding the difference between two
> > > > snapshots. The queries were processed with user-level Coprocessor.
> > > >
> > > >    In our experiments, we found that under some data schemas, the
> > queries
> > > > cannot get any results because of the connection timeout and RS crash
> > > > sometimes. We observed that in this case, the queried data were
> > centered
> > > in
> > > > a few regions locating in a few region servers. We think the failure
> > > might
> > > > be caused by the excess workload in these few region servers and the
> > > > inappropriate load balance. To our best knowledge, this case can be
> > > avoided
> > > > and improved by the well-distributed regions across the region
> servers.
> > > >
> > > >   Therefore, we have been thinking to add a monitoring and management
> > > > component between the client and server, which can schedule the
> > > > queries/jobs from client side and distribute the regions dynamically
> > > > according to the current workload of each region server, the incoming
> > > > queries and data locality.
> > > >
> > > >   Does it make sense? Just my two cents. Any comments?
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <anoopsj@huawei.com
> >
> > > > wrote:
> > > >
> > > > > Hi
> > > > > Can u share more details pls? What work you are doing within the
> CPs
> > > > >
> > > > > -Anoop-
> > > > > ________________________________________
> > > > > From: Dan Han [dannahan2008@gmail.com]
> > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Distribution of regions to servers
> > > > >
> > > > > Hi all,
> > > > >
> > > > >    I am doing some experiments on HBase with Coprocessor. I found
> > that
> > > > the
> > > > > performance
> > > > > of Coprocessor is impacted much by the distribution of the
> regions. I
> > > am
> > > > > kind of interested in
> > > > > going deep into this problem and see if I can do something.
> > > > >
> > > > >   I only searched out the discussion in the following link.
> > > > >
> > > > >
> > > >
> > >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> > > > >
> > > > > I am wondering if there is any further discussion or any on-going
> > work?
> > > > Can
> > > > > someone point it to me if there is?
> > > > > Thanks in advance.
> > > > >
> > > > > Best Wishes
> > > > > Dan Han
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Evgeny Morozov
> > > Developer Grid Dynamics
> > > Skype: morozov.evgeny
> > > www.griddynamics.com
> > > emorozov@griddynamics.com
> > >
> >
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>

Re: Distribution of regions to servers

Posted by Eugeny Morozov <em...@griddynamics.com>.

Dan, see inlined.

On Thu, Sep 27, 2012 at 5:30 AM, Dan Han <da...@gmail.com> wrote:

> Hi, Eugeny ,
>
>    Thanks for your response. I answered your questions inline in Blue.
> And I'd like to give an example to describe my problem.
>
> Let's think about two data schemas for the same dataset.
> The two data schemas have different composite row keys.


Just the first idea. If you have different schemas, then it would be much
simpler to have two different tables with these schemas. Because in this
case HBase itself automatically distribute each of the tables' regions
evenly across the cluster. You could actually use the same coprocessor for
both of the tables.

In case you're using two different column families, you could specify
different BLOCKSIZE  (default value is '65536''). You could set this option
different in 10 times for CFs (as the difference in between your schemas).
I believe this would decrease number of readings for larger data chunks.

In general it is actually not good to have two (or more) really different
in size column families, because they have compaction and flushing based on
region, which means that if  HBase start compacting small column family it
will do the same for big one.
http://hbase.apache.org/book.html#number.of.cfs

BTW, I don't think that coprocessors are good choice to have data mining.
The reason is that it is kind of dangerous. Since coprocessor are server
side creatures - they live in Region Server - they simply could get the
whole system down. Expensive analysis creates heap and CPU pressure, which
in turn lead to GC pauses and even more CPU pressure.

Consider to use PIG and HBaseStorage to load data from HBase.

But there is
> a same part in both schemas, which represents a sequence ID.
> In 1st schema, one row contains 1KB information;
> while in 2nd schema, one row contains 10KB information.
> So the number of rows in one region in 1st schema is more than
> that in 2nd schema, right? If the queried data is based on the sequence ID,
> as one region in 1st schema is responsible for more number of rows than
> that in 2nd schema,
> there would be more computation and long execution time for the
> corresponding coprocessor.
> So in this case, if the regions are not distributed well,
> some region servers will suffer in excess workload.
> That is why I want to do some management of regions to get better load
> balance based on large queries.
>
> Hope it makes sense to you.
>
> Best Wishes
> Dan Han
>
>
> On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> <em...@griddynamics.com>wrote:
>
> > Dan,
> >
> > I have additional questions.
> > What is the access pattern of your queries? I mean that f.e.
> PrefixFilters
> > have to be applied for all KeyValue pairs in HFiles, which could be slow.
> > Or f.e. scanner setCaching option is able to decrease number of network
> > hops to get data from RegionServer.
> >
>
>     I set the range of the rows and the related columns to narrow down the
> scan scope,
>     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
>     I set a little cache (5KB), but I kept it the same for all evaluated
> data schema.
>     Because I mainly focus on evaluate the performance of queries under the
> different data schemas.
>
>
> > Additionally, coprocessors are able to use InternalScanner instead of
> > ResultScanner, which is also could help greatly.
> >
>
>     yes, I used InternalScanner.
>
> >
> > Also, the more dimension you specify, the more precise your query is, the
> > less data is about to be processed - family, columns, timeranges, etc.
> >
> >
> > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com> wrote:
> >
> > >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > > explicate what we are doing now below.
> > >
> > >    We are trying to explore a systematic way to design the appropriate
> > data
> > > schema for various applications in HBase. So we first designed several
> > data
> > > schemas for each dataset and evaluate them with the same queries.  The
> > > queries are designed based on the requirements, such as selecting the
> > data
> > > with a matching expression, finding the difference between two
> > > snapshots. The queries were processed with user-level Coprocessor.
> > >
> > >    In our experiments, we found that under some data schemas, the
> queries
> > > cannot get any results because of the connection timeout and RS crash
> > > sometimes. We observed that in this case, the queried data were
> centered
> > in
> > > a few regions locating in a few region servers. We think the failure
> > might
> > > be caused by the excess workload in these few region servers and the
> > > inappropriate load balance. To our best knowledge, this case can be
> > avoided
> > > and improved by the well-distributed regions across the region servers.
> > >
> > >   Therefore, we have been thinking to add a monitoring and management
> > > component between the client and server, which can schedule the
> > > queries/jobs from client side and distribute the regions dynamically
> > > according to the current workload of each region server, the incoming
> > > queries and data locality.
> > >
> > >   Does it make sense? Just my two cents. Any comments?
> > >
> > > Best Wishes
> > > Dan Han
> > >
> > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <an...@huawei.com>
> > > wrote:
> > >
> > > > Hi
> > > > Can u share more details pls? What work you are doing within the CPs
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Dan Han [dannahan2008@gmail.com]
> > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Distribution of regions to servers
> > > >
> > > > Hi all,
> > > >
> > > >    I am doing some experiments on HBase with Coprocessor. I found
> that
> > > the
> > > > performance
> > > > of Coprocessor is impacted much by the distribution of the regions. I
> > am
> > > > kind of interested in
> > > > going deep into this problem and see if I can do something.
> > > >
> > > >   I only searched out the discussion in the following link.
> > > >
> > > >
> > >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> > > >
> > > > I am wondering if there is any further discussion or any on-going
> work?
> > > Can
> > > > someone point it to me if there is?
> > > > Thanks in advance.
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > >
> >
> >
> >
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emorozov@griddynamics.com
> >
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

RE: Distribution of regions to servers

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Just trying out here,

Is it possible for you to collocate the region of the 1st schema and the
region of the 2nd schema so that overall the total query execution happens
on single RS and there is not much
IO.
Also when you go with coprocessor on a collocated regions, the caching and
rpc timeout needs to be set accordingly.

Regards
Ram
> -----Original Message-----
> From: Dan Han [mailto:dannahan2008@gmail.com]
> Sent: Thursday, September 27, 2012 7:00 AM
> To: user@hbase.apache.org
> Subject: Re: Distribution of regions to servers
> 
> Hi, Eugeny ,
> 
>    Thanks for your response. I answered your questions inline in Blue.
> And I'd like to give an example to describe my problem.
> 
> Let's think about two data schemas for the same dataset.
> The two data schemas have different composite row keys. But there is
> a same part in both schemas, which represents a sequence ID.
> In 1st schema, one row contains 1KB information;
> while in 2nd schema, one row contains 10KB information.
> So the number of rows in one region in 1st schema is more than
> that in 2nd schema, right? If the queried data is based on the sequence
> ID,
> as one region in 1st schema is responsible for more number of rows than
> that in 2nd schema,
> there would be more computation and long execution time for the
> corresponding coprocessor.
> So in this case, if the regions are not distributed well,
> some region servers will suffer in excess workload.
> That is why I want to do some management of regions to get better load
> balance based on large queries.
> 
> Hope it makes sense to you.
> 
> Best Wishes
> Dan Han
> 
> 
> On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> <em...@griddynamics.com>wrote:
> 
> > Dan,
> >
> > I have additional questions.
> > What is the access pattern of your queries? I mean that f.e.
> PrefixFilters
> > have to be applied for all KeyValue pairs in HFiles, which could be
> slow.
> > Or f.e. scanner setCaching option is able to decrease number of
> network
> > hops to get data from RegionServer.
> >
> 
>     I set the range of the rows and the related columns to narrow down
> the
> scan scope,
>     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
>     I set a little cache (5KB), but I kept it the same for all
> evaluated
> data schema.
>     Because I mainly focus on evaluate the performance of queries under
> the
> different data schemas.
> 
> 
> > Additionally, coprocessors are able to use InternalScanner instead of
> > ResultScanner, which is also could help greatly.
> >
> 
>     yes, I used InternalScanner.
> 
> >
> > Also, the more dimension you specify, the more precise your query is,
> the
> > less data is about to be processed - family, columns, timeranges,
> etc.
> >
> >
> > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com>
> wrote:
> >
> > >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > > explicate what we are doing now below.
> > >
> > >    We are trying to explore a systematic way to design the
> appropriate
> > data
> > > schema for various applications in HBase. So we first designed
> several
> > data
> > > schemas for each dataset and evaluate them with the same queries.
> The
> > > queries are designed based on the requirements, such as selecting
> the
> > data
> > > with a matching expression, finding the difference between two
> > > snapshots. The queries were processed with user-level Coprocessor.
> > >
> > >    In our experiments, we found that under some data schemas, the
> queries
> > > cannot get any results because of the connection timeout and RS
> crash
> > > sometimes. We observed that in this case, the queried data were
> centered
> > in
> > > a few regions locating in a few region servers. We think the
> failure
> > might
> > > be caused by the excess workload in these few region servers and
> the
> > > inappropriate load balance. To our best knowledge, this case can be
> > avoided
> > > and improved by the well-distributed regions across the region
> servers.
> > >
> > >   Therefore, we have been thinking to add a monitoring and
> management
> > > component between the client and server, which can schedule the
> > > queries/jobs from client side and distribute the regions
> dynamically
> > > according to the current workload of each region server, the
> incoming
> > > queries and data locality.
> > >
> > >   Does it make sense? Just my two cents. Any comments?
> > >
> > > Best Wishes
> > > Dan Han
> > >
> > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John
> <an...@huawei.com>
> > > wrote:
> > >
> > > > Hi
> > > > Can u share more details pls? What work you are doing within the
> CPs
> > > >
> > > > -Anoop-
> > > > ________________________________________
> > > > From: Dan Han [dannahan2008@gmail.com]
> > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Distribution of regions to servers
> > > >
> > > > Hi all,
> > > >
> > > >    I am doing some experiments on HBase with Coprocessor. I found
> that
> > > the
> > > > performance
> > > > of Coprocessor is impacted much by the distribution of the
> regions. I
> > am
> > > > kind of interested in
> > > > going deep into this problem and see if I can do something.
> > > >
> > > >   I only searched out the discussion in the following link.
> > > >
> > > >
> > >
> > http://search-
> hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> f+regions+to+servers
> > > >
> > > > I am wondering if there is any further discussion or any on-going
> work?
> > > Can
> > > > someone point it to me if there is?
> > > > Thanks in advance.
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > >
> >
> >
> >
> > --
> > Evgeny Morozov
> > Developer Grid Dynamics
> > Skype: morozov.evgeny
> > www.griddynamics.com
> > emorozov@griddynamics.com
> >

RE: Distribution of regions to servers

Posted by "Ramkrishna.S.Vasudevan" <ra...@huawei.com>.

Hi Dan

Am not very sure whether my answer was infact relevant to your problem.
Any way I can try answering about the 'region being redundant'?
No two regions can be responsible for the same range of data in one table.
That is why if any region is not available that portion of data is not
available to the clients.

"when you go with coprocessor on  a collocated regions, the caching and  rpc
timeout needs to be set accordingly."
What I meant here was now every scan will hit two regions and as per your
use case one is going to be dense and other one will return quickly.
May be we may need to see that the overall scan is not timeout.  

Regards
Ram


> -----Original Message-----
> From: Dan Han [mailto:dannahan2008@gmail.com]
> Sent: Friday, September 28, 2012 3:05 AM
> To: user@hbase.apache.org
> Subject: Re: Distribution of regions to servers
> 
> Hi Ramkrishna,
> 
>   I think relocating regions is based on the queries and queried data.
> The relocation can scatter the regions involved in the query across
> region
> servers
> which might enable large queries get better load balance.
> For small queries, distribution of regions can also impact the
> throughput.
> 
> To this point, I actually have a question here: can the region
> be redundant?
> For example, there are two regions which are responsible for the same
> range
> of data?
> 
> I don't quite understand this: "when you go with coprocessor on
> a collocated regions, the caching and
> rpc timeout needs to be set accordingly."
> Could you please explain it further? Thanks in advance.
> 
> Best Wishes
> Dan Han
> 
> 
> On Wed, Sep 26, 2012 at 10:49 PM, Ramkrishna.S.Vasudevan <
> ramkrishna.vasudevan@huawei.com> wrote:
> 
> > Just trying out here,
> >
> > Is it possible for you to collocate the region of the 1st schema and
> the
> > region of the 2nd schema so that overall the total query execution
> happens
> > on single RS and there is not much
> > IO.
> > Also when you go with coprocessor on a collocated regions, the
> caching and
> > rpc timeout needs to be set accordingly.
> >
> > Regards
> > Ram
> > > -----Original Message-----
> > > From: Dan Han [mailto:dannahan2008@gmail.com]
> > > Sent: Thursday, September 27, 2012 7:00 AM
> > > To: user@hbase.apache.org
> > > Subject: Re: Distribution of regions to servers
> > >
> > > Hi, Eugeny ,
> > >
> > >    Thanks for your response. I answered your questions inline in
> Blue.
> > > And I'd like to give an example to describe my problem.
> > >
> > > Let's think about two data schemas for the same dataset.
> > > The two data schemas have different composite row keys. But there
> is
> > > a same part in both schemas, which represents a sequence ID.
> > > In 1st schema, one row contains 1KB information;
> > > while in 2nd schema, one row contains 10KB information.
> > > So the number of rows in one region in 1st schema is more than
> > > that in 2nd schema, right? If the queried data is based on the
> sequence
> > > ID,
> > > as one region in 1st schema is responsible for more number of rows
> than
> > > that in 2nd schema,
> > > there would be more computation and long execution time for the
> > > corresponding coprocessor.
> > > So in this case, if the regions are not distributed well,
> > > some region servers will suffer in excess workload.
> > > That is why I want to do some management of regions to get better
> load
> > > balance based on large queries.
> > >
> > > Hope it makes sense to you.
> > >
> > > Best Wishes
> > > Dan Han
> > >
> > >
> > > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > > <em...@griddynamics.com>wrote:
> > >
> > > > Dan,
> > > >
> > > > I have additional questions.
> > > > What is the access pattern of your queries? I mean that f.e.
> > > PrefixFilters
> > > > have to be applied for all KeyValue pairs in HFiles, which could
> be
> > > slow.
> > > > Or f.e. scanner setCaching option is able to decrease number of
> > > network
> > > > hops to get data from RegionServer.
> > > >
> > >
> > >     I set the range of the rows and the related columns to narrow
> down
> > > the
> > > scan scope,
> > >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the
> rows.
> > >     I set a little cache (5KB), but I kept it the same for all
> > > evaluated
> > > data schema.
> > >     Because I mainly focus on evaluate the performance of queries
> under
> > > the
> > > different data schemas.
> > >
> > >
> > > > Additionally, coprocessors are able to use InternalScanner
> instead of
> > > > ResultScanner, which is also could help greatly.
> > > >
> > >
> > >     yes, I used InternalScanner.
> > >
> > > >
> > > > Also, the more dimension you specify, the more precise your query
> is,
> > > the
> > > > less data is about to be processed - family, columns, timeranges,
> > > etc.
> > > >
> > > >
> > > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com>
> > > wrote:
> > > >
> > > > >   Thanks for your swift response, Ramkrishna and Anoop. And I
> will
> > > > > explicate what we are doing now below.
> > > > >
> > > > >    We are trying to explore a systematic way to design the
> > > appropriate
> > > > data
> > > > > schema for various applications in HBase. So we first designed
> > > several
> > > > data
> > > > > schemas for each dataset and evaluate them with the same
> queries.
> > > The
> > > > > queries are designed based on the requirements, such as
> selecting
> > > the
> > > > data
> > > > > with a matching expression, finding the difference between two
> > > > > snapshots. The queries were processed with user-level
> Coprocessor.
> > > > >
> > > > >    In our experiments, we found that under some data schemas,
> the
> > > queries
> > > > > cannot get any results because of the connection timeout and RS
> > > crash
> > > > > sometimes. We observed that in this case, the queried data were
> > > centered
> > > > in
> > > > > a few regions locating in a few region servers. We think the
> > > failure
> > > > might
> > > > > be caused by the excess workload in these few region servers
> and
> > > the
> > > > > inappropriate load balance. To our best knowledge, this case
> can be
> > > > avoided
> > > > > and improved by the well-distributed regions across the region
> > > servers.
> > > > >
> > > > >   Therefore, we have been thinking to add a monitoring and
> > > management
> > > > > component between the client and server, which can schedule the
> > > > > queries/jobs from client side and distribute the regions
> > > dynamically
> > > > > according to the current workload of each region server, the
> > > incoming
> > > > > queries and data locality.
> > > > >
> > > > >   Does it make sense? Just my two cents. Any comments?
> > > > >
> > > > > Best Wishes
> > > > > Dan Han
> > > > >
> > > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John
> > > <an...@huawei.com>
> > > > > wrote:
> > > > >
> > > > > > Hi
> > > > > > Can u share more details pls? What work you are doing within
> the
> > > CPs
> > > > > >
> > > > > > -Anoop-
> > > > > > ________________________________________
> > > > > > From: Dan Han [dannahan2008@gmail.com]
> > > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > > To: user@hbase.apache.org
> > > > > > Subject: Distribution of regions to servers
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > >    I am doing some experiments on HBase with Coprocessor. I
> found
> > > that
> > > > > the
> > > > > > performance
> > > > > > of Coprocessor is impacted much by the distribution of the
> > > regions. I
> > > > am
> > > > > > kind of interested in
> > > > > > going deep into this problem and see if I can do something.
> > > > > >
> > > > > >   I only searched out the discussion in the following link.
> > > > > >
> > > > > >
> > > > >
> > > > http://search-
> > >
> hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> > > f+regions+to+servers
> > > > > >
> > > > > > I am wondering if there is any further discussion or any on-
> going
> > > work?
> > > > > Can
> > > > > > someone point it to me if there is?
> > > > > > Thanks in advance.
> > > > > >
> > > > > > Best Wishes
> > > > > > Dan Han
> > > > > >
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Evgeny Morozov
> > > > Developer Grid Dynamics
> > > > Skype: morozov.evgeny
> > > > www.griddynamics.com
> > > > emorozov@griddynamics.com
> > > >
> >
> >

Re: Distribution of regions to servers

Posted by Dan Han <da...@gmail.com>.

Ramkrishna, I got your meaning. Thanks very much for your reply.

Best Wishes
Dan Han

On Thu, Sep 27, 2012 at 10:21 PM, Ramkrishna.S.Vasudevan <
ramkrishna.vasudevan@huawei.com> wrote:

> Hi Dan
>
> Am not very sure whether my answer was infact relevant to your problem.
> Any way I can try answering about the 'region being redundant'?
> No two regions can be responsible for the same range of data in one table.
> That is why if any region is not available that portion of data is not
> available to the clients.
>
> "when you go with coprocessor on  a collocated regions, the caching and
>  rpc
> timeout needs to be set accordingly."
> What I meant here was now every scan will hit two regions and as per your
> use case one is going to be dense and other one will return quickly.
> May be we may need to see that the overall scan is not timeout.
>
> Regards
> Ram
>
>
> > -----Original Message-----
> > From: Dan Han [mailto:dannahan2008@gmail.com]
> > Sent: Friday, September 28, 2012 3:05 AM
> > To: user@hbase.apache.org
> > Subject: Re: Distribution of regions to servers
> >
> > Hi Ramkrishna,
> >
> >   I think relocating regions is based on the queries and queried data.
> > The relocation can scatter the regions involved in the query across
> > region
> > servers
> > which might enable large queries get better load balance.
> > For small queries, distribution of regions can also impact the
> > throughput.
> >
> > To this point, I actually have a question here: can the region
> > be redundant?
> > For example, there are two regions which are responsible for the same
> > range
> > of data?
> >
> > I don't quite understand this: "when you go with coprocessor on
> > a collocated regions, the caching and
> > rpc timeout needs to be set accordingly."
> > Could you please explain it further? Thanks in advance.
> >
> > Best Wishes
> > Dan Han
> >
> >
> > On Wed, Sep 26, 2012 at 10:49 PM, Ramkrishna.S.Vasudevan <
> > ramkrishna.vasudevan@huawei.com> wrote:
> >
> > > Just trying out here,
> > >
> > > Is it possible for you to collocate the region of the 1st schema and
> > the
> > > region of the 2nd schema so that overall the total query execution
> > happens
> > > on single RS and there is not much
> > > IO.
> > > Also when you go with coprocessor on a collocated regions, the
> > caching and
> > > rpc timeout needs to be set accordingly.
> > >
> > > Regards
> > > Ram
> > > > -----Original Message-----
> > > > From: Dan Han [mailto:dannahan2008@gmail.com]
> > > > Sent: Thursday, September 27, 2012 7:00 AM
> > > > To: user@hbase.apache.org
> > > > Subject: Re: Distribution of regions to servers
> > > >
> > > > Hi, Eugeny ,
> > > >
> > > >    Thanks for your response. I answered your questions inline in
> > Blue.
> > > > And I'd like to give an example to describe my problem.
> > > >
> > > > Let's think about two data schemas for the same dataset.
> > > > The two data schemas have different composite row keys. But there
> > is
> > > > a same part in both schemas, which represents a sequence ID.
> > > > In 1st schema, one row contains 1KB information;
> > > > while in 2nd schema, one row contains 10KB information.
> > > > So the number of rows in one region in 1st schema is more than
> > > > that in 2nd schema, right? If the queried data is based on the
> > sequence
> > > > ID,
> > > > as one region in 1st schema is responsible for more number of rows
> > than
> > > > that in 2nd schema,
> > > > there would be more computation and long execution time for the
> > > > corresponding coprocessor.
> > > > So in this case, if the regions are not distributed well,
> > > > some region servers will suffer in excess workload.
> > > > That is why I want to do some management of regions to get better
> > load
> > > > balance based on large queries.
> > > >
> > > > Hope it makes sense to you.
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > > >
> > > > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > > > <em...@griddynamics.com>wrote:
> > > >
> > > > > Dan,
> > > > >
> > > > > I have additional questions.
> > > > > What is the access pattern of your queries? I mean that f.e.
> > > > PrefixFilters
> > > > > have to be applied for all KeyValue pairs in HFiles, which could
> > be
> > > > slow.
> > > > > Or f.e. scanner setCaching option is able to decrease number of
> > > > network
> > > > > hops to get data from RegionServer.
> > > > >
> > > >
> > > >     I set the range of the rows and the related columns to narrow
> > down
> > > > the
> > > > scan scope,
> > > >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the
> > rows.
> > > >     I set a little cache (5KB), but I kept it the same for all
> > > > evaluated
> > > > data schema.
> > > >     Because I mainly focus on evaluate the performance of queries
> > under
> > > > the
> > > > different data schemas.
> > > >
> > > >
> > > > > Additionally, coprocessors are able to use InternalScanner
> > instead of
> > > > > ResultScanner, which is also could help greatly.
> > > > >
> > > >
> > > >     yes, I used InternalScanner.
> > > >
> > > > >
> > > > > Also, the more dimension you specify, the more precise your query
> > is,
> > > > the
> > > > > less data is about to be processed - family, columns, timeranges,
> > > > etc.
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com>
> > > > wrote:
> > > > >
> > > > > >   Thanks for your swift response, Ramkrishna and Anoop. And I
> > will
> > > > > > explicate what we are doing now below.
> > > > > >
> > > > > >    We are trying to explore a systematic way to design the
> > > > appropriate
> > > > > data
> > > > > > schema for various applications in HBase. So we first designed
> > > > several
> > > > > data
> > > > > > schemas for each dataset and evaluate them with the same
> > queries.
> > > > The
> > > > > > queries are designed based on the requirements, such as
> > selecting
> > > > the
> > > > > data
> > > > > > with a matching expression, finding the difference between two
> > > > > > snapshots. The queries were processed with user-level
> > Coprocessor.
> > > > > >
> > > > > >    In our experiments, we found that under some data schemas,
> > the
> > > > queries
> > > > > > cannot get any results because of the connection timeout and RS
> > > > crash
> > > > > > sometimes. We observed that in this case, the queried data were
> > > > centered
> > > > > in
> > > > > > a few regions locating in a few region servers. We think the
> > > > failure
> > > > > might
> > > > > > be caused by the excess workload in these few region servers
> > and
> > > > the
> > > > > > inappropriate load balance. To our best knowledge, this case
> > can be
> > > > > avoided
> > > > > > and improved by the well-distributed regions across the region
> > > > servers.
> > > > > >
> > > > > >   Therefore, we have been thinking to add a monitoring and
> > > > management
> > > > > > component between the client and server, which can schedule the
> > > > > > queries/jobs from client side and distribute the regions
> > > > dynamically
> > > > > > according to the current workload of each region server, the
> > > > incoming
> > > > > > queries and data locality.
> > > > > >
> > > > > >   Does it make sense? Just my two cents. Any comments?
> > > > > >
> > > > > > Best Wishes
> > > > > > Dan Han
> > > > > >
> > > > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John
> > > > <an...@huawei.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Hi
> > > > > > > Can u share more details pls? What work you are doing within
> > the
> > > > CPs
> > > > > > >
> > > > > > > -Anoop-
> > > > > > > ________________________________________
> > > > > > > From: Dan Han [dannahan2008@gmail.com]
> > > > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > > > To: user@hbase.apache.org
> > > > > > > Subject: Distribution of regions to servers
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > >    I am doing some experiments on HBase with Coprocessor. I
> > found
> > > > that
> > > > > > the
> > > > > > > performance
> > > > > > > of Coprocessor is impacted much by the distribution of the
> > > > regions. I
> > > > > am
> > > > > > > kind of interested in
> > > > > > > going deep into this problem and see if I can do something.
> > > > > > >
> > > > > > >   I only searched out the discussion in the following link.
> > > > > > >
> > > > > > >
> > > > > >
> > > > > http://search-
> > > >
> > hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> > > > f+regions+to+servers
> > > > > > >
> > > > > > > I am wondering if there is any further discussion or any on-
> > going
> > > > work?
> > > > > > Can
> > > > > > > someone point it to me if there is?
> > > > > > > Thanks in advance.
> > > > > > >
> > > > > > > Best Wishes
> > > > > > > Dan Han
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Evgeny Morozov
> > > > > Developer Grid Dynamics
> > > > > Skype: morozov.evgeny
> > > > > www.griddynamics.com
> > > > > emorozov@griddynamics.com
> > > > >
> > >
> > >
>
>

Re: Distribution of regions to servers

Posted by Dan Han <da...@gmail.com>.

Hi Ramkrishna,

  I think relocating regions is based on the queries and queried data.
The relocation can scatter the regions involved in the query across region
servers
which might enable large queries get better load balance.
For small queries, distribution of regions can also impact the throughput.

To this point, I actually have a question here: can the region
be redundant?
For example, there are two regions which are responsible for the same range
of data?

I don't quite understand this: "when you go with coprocessor on
a collocated regions, the caching and
rpc timeout needs to be set accordingly."
Could you please explain it further? Thanks in advance.

Best Wishes
Dan Han


On Wed, Sep 26, 2012 at 10:49 PM, Ramkrishna.S.Vasudevan <
ramkrishna.vasudevan@huawei.com> wrote:

> Just trying out here,
>
> Is it possible for you to collocate the region of the 1st schema and the
> region of the 2nd schema so that overall the total query execution happens
> on single RS and there is not much
> IO.
> Also when you go with coprocessor on a collocated regions, the caching and
> rpc timeout needs to be set accordingly.
>
> Regards
> Ram
> > -----Original Message-----
> > From: Dan Han [mailto:dannahan2008@gmail.com]
> > Sent: Thursday, September 27, 2012 7:00 AM
> > To: user@hbase.apache.org
> > Subject: Re: Distribution of regions to servers
> >
> > Hi, Eugeny ,
> >
> >    Thanks for your response. I answered your questions inline in Blue.
> > And I'd like to give an example to describe my problem.
> >
> > Let's think about two data schemas for the same dataset.
> > The two data schemas have different composite row keys. But there is
> > a same part in both schemas, which represents a sequence ID.
> > In 1st schema, one row contains 1KB information;
> > while in 2nd schema, one row contains 10KB information.
> > So the number of rows in one region in 1st schema is more than
> > that in 2nd schema, right? If the queried data is based on the sequence
> > ID,
> > as one region in 1st schema is responsible for more number of rows than
> > that in 2nd schema,
> > there would be more computation and long execution time for the
> > corresponding coprocessor.
> > So in this case, if the regions are not distributed well,
> > some region servers will suffer in excess workload.
> > That is why I want to do some management of regions to get better load
> > balance based on large queries.
> >
> > Hope it makes sense to you.
> >
> > Best Wishes
> > Dan Han
> >
> >
> > On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
> > <em...@griddynamics.com>wrote:
> >
> > > Dan,
> > >
> > > I have additional questions.
> > > What is the access pattern of your queries? I mean that f.e.
> > PrefixFilters
> > > have to be applied for all KeyValue pairs in HFiles, which could be
> > slow.
> > > Or f.e. scanner setCaching option is able to decrease number of
> > network
> > > hops to get data from RegionServer.
> > >
> >
> >     I set the range of the rows and the related columns to narrow down
> > the
> > scan scope,
> >     and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
> >     I set a little cache (5KB), but I kept it the same for all
> > evaluated
> > data schema.
> >     Because I mainly focus on evaluate the performance of queries under
> > the
> > different data schemas.
> >
> >
> > > Additionally, coprocessors are able to use InternalScanner instead of
> > > ResultScanner, which is also could help greatly.
> > >
> >
> >     yes, I used InternalScanner.
> >
> > >
> > > Also, the more dimension you specify, the more precise your query is,
> > the
> > > less data is about to be processed - family, columns, timeranges,
> > etc.
> > >
> > >
> > > On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com>
> > wrote:
> > >
> > > >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > > > explicate what we are doing now below.
> > > >
> > > >    We are trying to explore a systematic way to design the
> > appropriate
> > > data
> > > > schema for various applications in HBase. So we first designed
> > several
> > > data
> > > > schemas for each dataset and evaluate them with the same queries.
> > The
> > > > queries are designed based on the requirements, such as selecting
> > the
> > > data
> > > > with a matching expression, finding the difference between two
> > > > snapshots. The queries were processed with user-level Coprocessor.
> > > >
> > > >    In our experiments, we found that under some data schemas, the
> > queries
> > > > cannot get any results because of the connection timeout and RS
> > crash
> > > > sometimes. We observed that in this case, the queried data were
> > centered
> > > in
> > > > a few regions locating in a few region servers. We think the
> > failure
> > > might
> > > > be caused by the excess workload in these few region servers and
> > the
> > > > inappropriate load balance. To our best knowledge, this case can be
> > > avoided
> > > > and improved by the well-distributed regions across the region
> > servers.
> > > >
> > > >   Therefore, we have been thinking to add a monitoring and
> > management
> > > > component between the client and server, which can schedule the
> > > > queries/jobs from client side and distribute the regions
> > dynamically
> > > > according to the current workload of each region server, the
> > incoming
> > > > queries and data locality.
> > > >
> > > >   Does it make sense? Just my two cents. Any comments?
> > > >
> > > > Best Wishes
> > > > Dan Han
> > > >
> > > > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John
> > <an...@huawei.com>
> > > > wrote:
> > > >
> > > > > Hi
> > > > > Can u share more details pls? What work you are doing within the
> > CPs
> > > > >
> > > > > -Anoop-
> > > > > ________________________________________
> > > > > From: Dan Han [dannahan2008@gmail.com]
> > > > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > > > To: user@hbase.apache.org
> > > > > Subject: Distribution of regions to servers
> > > > >
> > > > > Hi all,
> > > > >
> > > > >    I am doing some experiments on HBase with Coprocessor. I found
> > that
> > > > the
> > > > > performance
> > > > > of Coprocessor is impacted much by the distribution of the
> > regions. I
> > > am
> > > > > kind of interested in
> > > > > going deep into this problem and see if I can do something.
> > > > >
> > > > >   I only searched out the discussion in the following link.
> > > > >
> > > > >
> > > >
> > > http://search-
> > hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+o
> > f+regions+to+servers
> > > > >
> > > > > I am wondering if there is any further discussion or any on-going
> > work?
> > > > Can
> > > > > someone point it to me if there is?
> > > > > Thanks in advance.
> > > > >
> > > > > Best Wishes
> > > > > Dan Han
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Evgeny Morozov
> > > Developer Grid Dynamics
> > > Skype: morozov.evgeny
> > > www.griddynamics.com
> > > emorozov@griddynamics.com
> > >
>
>

Re: Distribution of regions to servers

Posted by Dan Han <da...@gmail.com>.

Hi, Eugeny ,

   Thanks for your response. I answered your questions inline in Blue.
And I'd like to give an example to describe my problem.

Let's think about two data schemas for the same dataset.
The two data schemas have different composite row keys. But there is
a same part in both schemas, which represents a sequence ID.
In 1st schema, one row contains 1KB information;
while in 2nd schema, one row contains 10KB information.
So the number of rows in one region in 1st schema is more than
that in 2nd schema, right? If the queried data is based on the sequence ID,
as one region in 1st schema is responsible for more number of rows than
that in 2nd schema,
there would be more computation and long execution time for the
corresponding coprocessor.
So in this case, if the regions are not distributed well,
some region servers will suffer in excess workload.
That is why I want to do some management of regions to get better load
balance based on large queries.

Hope it makes sense to you.

Best Wishes
Dan Han


On Wed, Sep 26, 2012 at 3:19 PM, Eugeny Morozov
<em...@griddynamics.com>wrote:

> Dan,
>
> I have additional questions.
> What is the access pattern of your queries? I mean that f.e. PrefixFilters
> have to be applied for all KeyValue pairs in HFiles, which could be slow.
> Or f.e. scanner setCaching option is able to decrease number of network
> hops to get data from RegionServer.
>

    I set the range of the rows and the related columns to narrow down the
scan scope,
    and I used PrefixFilter/ColumnFilter/BinaryFilter to get the rows.
    I set a little cache (5KB), but I kept it the same for all evaluated
data schema.
    Because I mainly focus on evaluate the performance of queries under the
different data schemas.


> Additionally, coprocessors are able to use InternalScanner instead of
> ResultScanner, which is also could help greatly.
>

    yes, I used InternalScanner.

>
> Also, the more dimension you specify, the more precise your query is, the
> less data is about to be processed - family, columns, timeranges, etc.
>
>
> On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com> wrote:
>
> >   Thanks for your swift response, Ramkrishna and Anoop. And I will
> > explicate what we are doing now below.
> >
> >    We are trying to explore a systematic way to design the appropriate
> data
> > schema for various applications in HBase. So we first designed several
> data
> > schemas for each dataset and evaluate them with the same queries.  The
> > queries are designed based on the requirements, such as selecting the
> data
> > with a matching expression, finding the difference between two
> > snapshots. The queries were processed with user-level Coprocessor.
> >
> >    In our experiments, we found that under some data schemas, the queries
> > cannot get any results because of the connection timeout and RS crash
> > sometimes. We observed that in this case, the queried data were centered
> in
> > a few regions locating in a few region servers. We think the failure
> might
> > be caused by the excess workload in these few region servers and the
> > inappropriate load balance. To our best knowledge, this case can be
> avoided
> > and improved by the well-distributed regions across the region servers.
> >
> >   Therefore, we have been thinking to add a monitoring and management
> > component between the client and server, which can schedule the
> > queries/jobs from client side and distribute the regions dynamically
> > according to the current workload of each region server, the incoming
> > queries and data locality.
> >
> >   Does it make sense? Just my two cents. Any comments?
> >
> > Best Wishes
> > Dan Han
> >
> > On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <an...@huawei.com>
> > wrote:
> >
> > > Hi
> > > Can u share more details pls? What work you are doing within the CPs
> > >
> > > -Anoop-
> > > ________________________________________
> > > From: Dan Han [dannahan2008@gmail.com]
> > > Sent: Wednesday, September 26, 2012 5:55 AM
> > > To: user@hbase.apache.org
> > > Subject: Distribution of regions to servers
> > >
> > > Hi all,
> > >
> > >    I am doing some experiments on HBase with Coprocessor. I found that
> > the
> > > performance
> > > of Coprocessor is impacted much by the distribution of the regions. I
> am
> > > kind of interested in
> > > going deep into this problem and see if I can do something.
> > >
> > >   I only searched out the discussion in the following link.
> > >
> > >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> > >
> > > I am wondering if there is any further discussion or any on-going work?
> > Can
> > > someone point it to me if there is?
> > > Thanks in advance.
> > >
> > > Best Wishes
> > > Dan Han
> > >
> >
>
>
>
> --
> Evgeny Morozov
> Developer Grid Dynamics
> Skype: morozov.evgeny
> www.griddynamics.com
> emorozov@griddynamics.com
>

Re: Distribution of regions to servers

Posted by Eugeny Morozov <em...@griddynamics.com>.

Dan,

I have additional questions.
What is the access pattern of your queries? I mean that f.e. PrefixFilters
have to be applied for all KeyValue pairs in HFiles, which could be slow.
Or f.e. scanner setCaching option is able to decrease number of network
hops to get data from RegionServer.

Additionally, coprocessors are able to use InternalScanner instead of
ResultScanner, which is also could help greatly.

Also, the more dimension you specify, the more precise your query is, the
less data is about to be processed - family, columns, timeranges, etc.


On Wed, Sep 26, 2012 at 7:39 PM, Dan Han <da...@gmail.com> wrote:

>   Thanks for your swift response, Ramkrishna and Anoop. And I will
> explicate what we are doing now below.
>
>    We are trying to explore a systematic way to design the appropriate data
> schema for various applications in HBase. So we first designed several data
> schemas for each dataset and evaluate them with the same queries.  The
> queries are designed based on the requirements, such as selecting the data
> with a matching expression, finding the difference between two
> snapshots. The queries were processed with user-level Coprocessor.
>
>    In our experiments, we found that under some data schemas, the queries
> cannot get any results because of the connection timeout and RS crash
> sometimes. We observed that in this case, the queried data were centered in
> a few regions locating in a few region servers. We think the failure might
> be caused by the excess workload in these few region servers and the
> inappropriate load balance. To our best knowledge, this case can be avoided
> and improved by the well-distributed regions across the region servers.
>
>   Therefore, we have been thinking to add a monitoring and management
> component between the client and server, which can schedule the
> queries/jobs from client side and distribute the regions dynamically
> according to the current workload of each region server, the incoming
> queries and data locality.
>
>   Does it make sense? Just my two cents. Any comments?
>
> Best Wishes
> Dan Han
>
> On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <an...@huawei.com>
> wrote:
>
> > Hi
> > Can u share more details pls? What work you are doing within the CPs
> >
> > -Anoop-
> > ________________________________________
> > From: Dan Han [dannahan2008@gmail.com]
> > Sent: Wednesday, September 26, 2012 5:55 AM
> > To: user@hbase.apache.org
> > Subject: Distribution of regions to servers
> >
> > Hi all,
> >
> >    I am doing some experiments on HBase with Coprocessor. I found that
> the
> > performance
> > of Coprocessor is impacted much by the distribution of the regions. I am
> > kind of interested in
> > going deep into this problem and see if I can do something.
> >
> >   I only searched out the discussion in the following link.
> >
> >
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
> >
> > I am wondering if there is any further discussion or any on-going work?
> Can
> > someone point it to me if there is?
> > Thanks in advance.
> >
> > Best Wishes
> > Dan Han
> >
>



-- 
Evgeny Morozov
Developer Grid Dynamics
Skype: morozov.evgeny
www.griddynamics.com
emorozov@griddynamics.com

Re: Distribution of regions to servers

Posted by Dan Han <da...@gmail.com>.

  Thanks for your swift response, Ramkrishna and Anoop. And I will
explicate what we are doing now below.

   We are trying to explore a systematic way to design the appropriate data
schema for various applications in HBase. So we first designed several data
schemas for each dataset and evaluate them with the same queries.  The
queries are designed based on the requirements, such as selecting the data
with a matching expression, finding the difference between two
snapshots. The queries were processed with user-level Coprocessor.

   In our experiments, we found that under some data schemas, the queries
cannot get any results because of the connection timeout and RS crash
sometimes. We observed that in this case, the queried data were centered in
a few regions locating in a few region servers. We think the failure might
be caused by the excess workload in these few region servers and the
inappropriate load balance. To our best knowledge, this case can be avoided
and improved by the well-distributed regions across the region servers.

  Therefore, we have been thinking to add a monitoring and management
component between the client and server, which can schedule the
queries/jobs from client side and distribute the regions dynamically
according to the current workload of each region server, the incoming
queries and data locality.

  Does it make sense? Just my two cents. Any comments?

Best Wishes
Dan Han

On Tue, Sep 25, 2012 at 10:44 PM, Anoop Sam John <an...@huawei.com> wrote:

> Hi
> Can u share more details pls? What work you are doing within the CPs
>
> -Anoop-
> ________________________________________
> From: Dan Han [dannahan2008@gmail.com]
> Sent: Wednesday, September 26, 2012 5:55 AM
> To: user@hbase.apache.org
> Subject: Distribution of regions to servers
>
> Hi all,
>
>    I am doing some experiments on HBase with Coprocessor. I found that the
> performance
> of Coprocessor is impacted much by the distribution of the regions. I am
> kind of interested in
> going deep into this problem and see if I can do something.
>
>   I only searched out the discussion in the following link.
>
> http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers
>
> I am wondering if there is any further discussion or any on-going work? Can
> someone point it to me if there is?
> Thanks in advance.
>
> Best Wishes
> Dan Han
>

RE: Distribution of regions to servers

Posted by Anoop Sam John <an...@huawei.com>.

Hi
Can u share more details pls? What work you are doing within the CPs

-Anoop-
________________________________________
From: Dan Han [dannahan2008@gmail.com]
Sent: Wednesday, September 26, 2012 5:55 AM
To: user@hbase.apache.org
Subject: Distribution of regions to servers

Hi all,

   I am doing some experiments on HBase with Coprocessor. I found that the
performance
of Coprocessor is impacted much by the distribution of the regions. I am
kind of interested in
going deep into this problem and see if I can do something.

  I only searched out the discussion in the following link.
http://search-hadoop.com/m/Vjhgj1lqw7Y1/hbase+distribution+region&subj=distribution+of+regions+to+servers

I am wondering if there is any further discussion or any on-going work? Can
someone point it to me if there is?
Thanks in advance.

Best Wishes
Dan Han