You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Dani Rayan <da...@gmail.com> on 2011/01/25 04:26:58 UTC

reset ResultScanner cursor

Hi,

Currently I am calling table.getScanner each time to reset the cursor to
initial row.

My code is something like this:

while (1)
{
 /*
  * I need  a cursor to first row each time.
  * Also, I tried storing the ResultScanner into a temp obj to avoid calling
table.getscanner. It didn't work
 * /

  ResultScanner refscanner = table.getScanner(Bytes.toBytes("ColA")); //
Looks expensive.
  for (Result refResult = refscanner.next(); refResult != null; refResult =
refscanner.next()) {
     // do someting
  }
}

The getscanner operation looks expensive. Am I m(i,e)ssing something ?
Yes. RFTMed.
Any help would be appreciated.


-Thanks,
Dani
http://www.cc.gatech.edu/~iar3/

Re: reset ResultScanner cursor

Posted by Stack <st...@duboce.net>.
What you trying to do Dani.  There is sample code here if thats of any
good to you: http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/package-summary.html#package_description

St.Ack

On Mon, Jan 24, 2011 at 7:26 PM, Dani Rayan <da...@gmail.com> wrote:
> Hi,
>
> Currently I am calling table.getScanner each time to reset the cursor to
> initial row.
>
> My code is something like this:
>
> while (1)
> {
>  /*
>  * I need  a cursor to first row each time.
>  * Also, I tried storing the ResultScanner into a temp obj to avoid calling
> table.getscanner. It didn't work
>  * /
>
>  ResultScanner refscanner = table.getScanner(Bytes.toBytes("ColA")); //
> Looks expensive.
>  for (Result refResult = refscanner.next(); refResult != null; refResult =
> refscanner.next()) {
>     // do someting
>  }
> }
>
> The getscanner operation looks expensive. Am I m(i,e)ssing something ?
> Yes. RFTMed.
> Any help would be appreciated.
>
>
> -Thanks,
> Dani
> http://www.cc.gatech.edu/~iar3/
>

Re: Scan and addColumn and Filters

Posted by Ryan Rawson <ry...@gmail.com>.
This is the expected result (un)fortunately.

It isn't javadoced at the top level, but deep inside you can find:

http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/SingleColumnValueFilter.html

where it says:

"When using this filter on a Scan with specified inputs, the column to
be tested should also be added as input (otherwise the filter will
regard the column as missing). "

Which is a little cryptic, but more helpful than nothing.

I just opened:  https://issues.apache.org/jira/browse/HBASE-3479 If
you'd like to contribute.

-ryan


On Tue, Jan 25, 2011 at 3:11 PM, Peter Haidinyak <ph...@local.com> wrote:
> Hi All,
>   I finally figured out what was happening with my scans not bringing back any results when using some filters. It turns out if I don't add the column, via scan.addColumn(), for the column I am filtering for the scan will not return data. I was trying to reduce the amount of data being returned from a scan so I didn't add any columns I didn't need for display, they are just used to filter results.
>        Is this an expected result? The javadocs didn't mention anything about it (that I could find).
>
> Thanks
>
> -Pete
>

Scan and addColumn and Filters

Posted by Peter Haidinyak <ph...@local.com>.
Hi All,
   I finally figured out what was happening with my scans not bringing back any results when using some filters. It turns out if I don't add the column, via scan.addColumn(), for the column I am filtering for the scan will not return data. I was trying to reduce the amount of data being returned from a scan so I didn't add any columns I didn't need for display, they are just used to filter results.
	Is this an expected result? The javadocs didn't mention anything about it (that I could find).

Thanks

-Pete

Re: reset ResultScanner cursor

Posted by Dani Rayan <da...@gmail.com>.
Hey tsuna. I changed the algorithm significantly and eliminated the "nested"
loop and it works
lightening fast. I do scans separately instead of nesting.

Anyways, I have retained old code for revisiting later to find out why
nested scans function poorly (perhaps only on single machine -
pseudo-distributed mode)

Does your table fit entirely in one region?  How big are the rows?
> Are you writing a lot to your table?  Are you typically inserting
> cells or overwriting stuff in existing ones?
>
> No it doesn't. It has spawned several regions.
> The rows are sparse, sometimes as huge as "storing a web-page" for a
column and sometimes
very small, just meta data.
> Yes! I do overwrite entire  rows often (after the proof of concept, this
won't happen)

Is your pseudo-distributed HBase running on a single machine?  If yes,
> why not use a non-distributed HBase setup (without HDFS)?
>
> Yes it is running on single machine.
> Good suggestion. Should setup separately.

-Thanks,
Dani

On Tue, Jan 25, 2011 at 11:41 PM, tsuna <ts...@gmail.com> wrote:

> On Tue, Jan 25, 2011 at 2:14 PM, Dani Rayan <da...@gmail.com> wrote:
> > But opening and closing the scanner inside this nested loop is taking
> > mulitple seconds to complete on just 3000 rows :(
>
> Something is wrong with your cluster or the way you use it.  The
> overhead of opening / closing the scanner is normally absolutely
> negligible compared to the overhead to scan the full table, even with
> a table as small as just 3000 rows.
>
> Does your table fit entirely in one region?  How big are the rows?
> Are you writing a lot to your table?  Are you typically inserting
> cells or overwriting stuff in existing ones?
>
> Is your pseudo-distributed HBase running on a single machine?  If yes,
> why not use a non-distributed HBase setup (without HDFS)?
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

Re: reset ResultScanner cursor

Posted by tsuna <ts...@gmail.com>.
On Tue, Jan 25, 2011 at 2:14 PM, Dani Rayan <da...@gmail.com> wrote:
> But opening and closing the scanner inside this nested loop is taking
> mulitple seconds to complete on just 3000 rows :(

Something is wrong with your cluster or the way you use it.  The
overhead of opening / closing the scanner is normally absolutely
negligible compared to the overhead to scan the full table, even with
a table as small as just 3000 rows.

Does your table fit entirely in one region?  How big are the rows?
Are you writing a lot to your table?  Are you typically inserting
cells or overwriting stuff in existing ones?

Is your pseudo-distributed HBase running on a single machine?  If yes,
why not use a non-distributed HBase setup (without HDFS)?

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com

Re: reset ResultScanner cursor

Posted by Dani Rayan <da...@gmail.com>.
Thanks! tsuna. I'm having multiple scanners operating on same table  and
rows are getting inserted at every moment (lock contention is possible).

Moreover, this is in dev phase and I'm running in pseudo distributed mode
with plenty of RAM (16GB)

>The current RPC protocol forces you to make this call.  You can't seek
> >back with a scanner.  When you move forward, the only way to go back
> > is to close the scanner and open a new one again.
>

Stack, I want to reset the scanner to initial position after a complete full
scan, this avoids opening/closing a new one. Yes, I have seen that link.
Thanks for that pointing again.

In a nutshell, I want to:
while(1)
{
   scan_all_rows_in_tableT_colFamA
   {

     do_something_on_each_row_to_find_a_tagX;

       //right-now opening a new scanner each time for below loop
       //getScanner(tableT, colFamB)

        scan_all_rows_in_tableT_which_has_colFamB:tagX;
        {
          //do_something
        }
       //closing scanner
   }
}

Since, I am operating only on a single tableT for this logic, I want to try
without MR jobs.
But opening and closing the scanner inside this nested loop is taking
mulitple seconds to complete on just 3000 rows :(

-Thanks,
Dani.

On Mon, Jan 24, 2011 at 11:30 PM, tsuna <ts...@gmail.com> wrote:

> On Mon, Jan 24, 2011 at 7:26 PM, Dani Rayan <da...@gmail.com> wrote:
> >  ResultScanner refscanner = table.getScanner(Bytes.toBytes("ColA")); //
> > Looks expensive.
>
> > The getscanner operation looks expensive. Am I m(i,e)ssing something ?
>
> This shouldn't be expensive.  What happens under the hood is that the
> client makes an "openScanner" RPC call to the RegionServer, to which
> the RS responds with a scanner ID.  The state of the scanner is stored
> in the RS.
>
> The current RPC protocol forces you to make this call.  You can't seek
> back with a scanner.  When you move forward, the only way to go back
> is to close the scanner and open a new one again.
>
> Opening a scanner shouldn't take long, we're talking about
> milliseconds (I'm seeing ~2ms in one of our production clusters at
> StumbleUpon).  Are your RegionServers very busy?  Have you seen
> anything that might look like excessive GCing or lock contention?
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ www.StumbleUpon.com
>

Re: reset ResultScanner cursor

Posted by tsuna <ts...@gmail.com>.
On Mon, Jan 24, 2011 at 7:26 PM, Dani Rayan <da...@gmail.com> wrote:
>  ResultScanner refscanner = table.getScanner(Bytes.toBytes("ColA")); //
> Looks expensive.

> The getscanner operation looks expensive. Am I m(i,e)ssing something ?

This shouldn't be expensive.  What happens under the hood is that the
client makes an "openScanner" RPC call to the RegionServer, to which
the RS responds with a scanner ID.  The state of the scanner is stored
in the RS.

The current RPC protocol forces you to make this call.  You can't seek
back with a scanner.  When you move forward, the only way to go back
is to close the scanner and open a new one again.

Opening a scanner shouldn't take long, we're talking about
milliseconds (I'm seeing ~2ms in one of our production clusters at
StumbleUpon).  Are your RegionServers very busy?  Have you seen
anything that might look like excessive GCing or lock contention?

-- 
Benoit "tsuna" Sigoure
Software Engineer @ www.StumbleUpon.com