You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Xine Jar <xi...@googlemail.com> on 2010/01/14 19:08:22 UTC

Enhancing the processing of my hbase

Hallo,
my question is more on the architecture side of my program.

*General view:*
I have a huge HBase table containing thousands of rows. Each row contains an
ID of a node and its geographical location.
A single region of the table contains approximately 10 000 rows.

*Aim*:
I would like to calculate the distance between each pair of nodes. Meaning
that a task responsible of a region of 10 000 nodes needs to read
10 000*10 000 times.

*My architecture:*
I have created two scanners A and B. The scanner A points of the source and
the scanner B scans all the destination points. Meaning that, the scanner A
at the beginning points of the first row of the region and the scanner B
scans the rest of the nodes. Once done, The scanner A passes to the second
node and again B scans all the nodes. That's how I calculate all the pair
distances.

*My problem:*
I had a problem that the scanner A was timing out because the processing
takes time until it passes to the next row, so I have incremented the value
of the lease time, this was helpful for a region of 1000 nodes but not for
10 000 nodes.

*My question:*
1-I feel that this value should not just go up and up because my processing
is heavy, or not? Will it have some side effects if it becomes large?

2-Shouldn't I change the structure or the idea of my program? Can someone
give me a hint of how this is possible?

Thank you

Re: Enhancing the processing of my hbase

Posted by Ryan Rawson <ry...@gmail.com>.

It sounds like you are doing a cross product.. you could use map reduce to
generate the pairs then compute your function on them...

On Jan 18, 2010 3:19 AM, "Xine Jar" <xi...@googlemail.com> wrote:

Thank you for your answer,
The closing of scanner A sounds good it. I'll try it out, but I guess
it will solve the problem of the scanner timeout.

On Sun, Jan 17, 2010 at 10:33 PM, stack <st...@duboce.net> wrote: > On Thu,
Jan 14, 2010 at 10:08 ...

Re: Enhancing the processing of my hbase

Posted by Xine Jar <xi...@googlemail.com>.

Thank you for your answer,
The closing of scanner A sounds good it. I'll try it out, but I guess
it will solve the problem of the scanner timeout.

On Sun, Jan 17, 2010 at 10:33 PM, stack <st...@duboce.net> wrote:

> On Thu, Jan 14, 2010 at 10:08 AM, Xine Jar <xi...@googlemail.com>
> wrote:
>
> > Hallo,
> > my question is more on the architecture side of my program.
> >
> > *General view:*
> > I have a huge HBase table containing thousands of rows. Each row contains
> > an
> > ID of a node and its geographical location.
> > A single region of the table contains approximately 10 000 rows.
> >
> > *Aim*:
> > I would like to calculate the distance between each pair of nodes.
> Meaning
> > that a task responsible of a region of 10 000 nodes needs to read
> > 10 000*10 000 times.
> >
>
>
> Will your data fit in memory?   You could enable the in-memory option on
> the
> column family for your table.
>
>
> >
> > *My architecture:*
> > I have created two scanners A and B. The scanner A points of the source
> and
> > the scanner B scans all the destination points. Meaning that, the scanner
> A
> > at the beginning points of the first row of the region and the scanner B
> > scans the rest of the nodes. Once done, The scanner A passes to the
> second
> > node and again B scans all the nodes. That's how I calculate all the pair
> > distances.
> >
>
> Good.
>
>
>
> >
> > *My problem:*
> > I had a problem that the scanner A was timing out because the processing
> > takes time until it passes to the next row, so I have incremented the
> value
> > of the lease time, this was helpful for a region of 1000 nodes but not
> for
> > 10 000 nodes.
> >
>
> So, maybe, open scanner A, scan row 1 and then 2.  Save what row 2 is.
>  Close the scanner.  Then start scanner B processing for row 1.  When
> scanner B is done, start up a new Scanner A but have its startrow be row 2.
>  Figure what row 3 is.  Close the scanner, and so on.
>
> Or open Scanner A... scan 100 rows.  Save them off.  Run Scanner B for this
> first 100 rows.  When done.  Start Scanner A again at row 101 and get next
> 100 rows?
>
>
>
> >
> > *My question:*
> > 1-I feel that this value should not just go up and up because my
> processing
> > is heavy, or not? Will it have some side effects if it becomes large?
> >
> > We need some kind of lease so that server-side resources are cleaned up.
>
> Its hard to tell between a legitmate case where you want to keep the
> scanner
> open and then a scanner than just lapses.
>
> Should we add being able to set the timeout on a scanner by scanner basis?
>
> Or, does the above sketch work for you you where Scanner A steps through
> the
> region?
>
>
>
> > 2-Shouldn't I change the structure or the idea of my program? Can someone
> > give me a hint of how this is possible?
> >
>
> Maybe someone has a better idea here.
>
> Ideally, you'd want to run the 10k*10k calcuation over inside the
> regionserver per region.  You need something like the coprocessors facility
> that is coming down the pipe (HBASE-2001) it sounds like.
>
> St.Ack
>
>
>
> >
> > Thank you
> >
>

Re: Enhancing the processing of my hbase

Posted by stack <st...@duboce.net>.

On Thu, Jan 14, 2010 at 10:08 AM, Xine Jar <xi...@googlemail.com> wrote:

> Hallo,
> my question is more on the architecture side of my program.
>
> *General view:*
> I have a huge HBase table containing thousands of rows. Each row contains
> an
> ID of a node and its geographical location.
> A single region of the table contains approximately 10 000 rows.
>
> *Aim*:
> I would like to calculate the distance between each pair of nodes. Meaning
> that a task responsible of a region of 10 000 nodes needs to read
> 10 000*10 000 times.
>

Will your data fit in memory?   You could enable the in-memory option on the
column family for your table.

>
> *My architecture:*
> I have created two scanners A and B. The scanner A points of the source and
> the scanner B scans all the destination points. Meaning that, the scanner A
> at the beginning points of the first row of the region and the scanner B
> scans the rest of the nodes. Once done, The scanner A passes to the second
> node and again B scans all the nodes. That's how I calculate all the pair
> distances.
>

Good.

>
> *My problem:*
> I had a problem that the scanner A was timing out because the processing
> takes time until it passes to the next row, so I have incremented the value
> of the lease time, this was helpful for a region of 1000 nodes but not for
> 10 000 nodes.
>

So, maybe, open scanner A, scan row 1 and then 2.  Save what row 2 is.
 Close the scanner.  Then start scanner B processing for row 1.  When
scanner B is done, start up a new Scanner A but have its startrow be row 2.
 Figure what row 3 is.  Close the scanner, and so on.

Or open Scanner A... scan 100 rows.  Save them off.  Run Scanner B for this
first 100 rows.  When done.  Start Scanner A again at row 101 and get next
100 rows?

>
> *My question:*
> 1-I feel that this value should not just go up and up because my processing
> is heavy, or not? Will it have some side effects if it becomes large?
>
> We need some kind of lease so that server-side resources are cleaned up.

Its hard to tell between a legitmate case where you want to keep the scanner
open and then a scanner than just lapses.

Should we add being able to set the timeout on a scanner by scanner basis?

Or, does the above sketch work for you you where Scanner A steps through the
region?

> 2-Shouldn't I change the structure or the idea of my program? Can someone
> give me a hint of how this is possible?
>

Maybe someone has a better idea here.

Ideally, you'd want to run the 10k*10k calcuation over inside the
regionserver per region.  You need something like the coprocessors facility
that is coming down the pipe (HBASE-2001) it sounds like.

St.Ack

>
> Thank you
>