You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Geoff Hendrey <gh...@decarta.com> on 2011/09/06 04:11:16 UTC

ScannerTimeoutException

Hi -

 

I found some odd behavior with ResultScanner.next(). Usually the times
for next() are couple hundred ms. But occasionally the call to next
spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
but once in a while the call to next() itself is interrupted by the
ScannerTimeoutException after more than 60 seconds. It seems odd that
the call to next itself can be interrupted because "61107ms passed since
the last invocation, timeout is currently set to 60000"
 
The only thing I can think of is that a GC kicks in after the call to
next begins, but before the call returns, and the server is still
ticking the timeout. But this seems to happen periodically with
regularity. The odds of the GC kicking in at that exact instant, so
often, seem kind of low.
 
-geoff

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

I'll try your suggestions!

-----Original Message-----
From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz] 
Sent: Tuesday, September 06, 2011 9:48 AM
To: user@hbase.apache.org
Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
Subject: Re: ScannerTimeoutException

Hi Geoff,

we are having these issues when the scanner uses scan.addColumn() and 
the column is sparse, in the sense there are many rows with some other 
column in the same column family. I suppose your problems will vanish if 
you use scan.addFamily() call instead. The same behavior may appear if 
you are reading from region after massive delete (then the timeouts 
settle down after major compation), or when using server side Filters.

Changing scan.addColumn() to scan.addFamily() brings some overhead, 
which I think could be removed by RegionServer renewing the lease of 
scanner while reading data, not only after first entry to 
HRegionServer.next().

Would this be worth opening a JIRA?

Jan

On 6.9.2011 04:11, Geoff Hendrey wrote:
> Hi -
>
>
>
> I found some odd behavior with ResultScanner.next(). Usually the times
> for next() are couple hundred ms. But occasionally the call to next
> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
> but once in a while the call to next() itself is interrupted by the
> ScannerTimeoutException after more than 60 seconds. It seems odd that
> the call to next itself can be interrupted because "61107ms passed since
> the last invocation, timeout is currently set to 60000"
>
> The only thing I can think of is that a GC kicks in after the call to
> next begins, but before the call returns, and the server is still
> ticking the timeout. But this seems to happen periodically with
> regularity. The odds of the GC kicking in at that exact instant, so
> often, seem kind of low.
>
> -geoff
>
>
>

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

I am connecting to the table, by creating the table in the reducer's setup method. I am not using HTablePool.

-geoff

-----Original Message-----
From: Sonal Goyal [mailto:sonalgoyal4@gmail.com] 
Sent: Thursday, September 08, 2011 8:41 AM
To: user@hbase.apache.org
Subject: Re: ScannerTimeoutException

How are you connecting to the table? Are you using the HTablePool?

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Thu, Sep 8, 2011 at 8:53 PM, Geoff Hendrey <gh...@decarta.com> wrote:

> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the
> ScannerTimeoutException. Arghhh. This is really becoming a mission critical
> problem for my team...Furthermore, the problem becomes more and more
> frequent, to the point that the job almost never finishes. At the beginning
> of the job, the timout never happens. By the end, it's happening 9 out of 10
> attempts...
>
>                Scan scan = new Scan(Bytes.toBytes(key));
>                scan.setCaching(1);
>                scan.setMaxVersions(1);
>                scan.addFamily(Bytes.toBytes("V1"));
>                scan.addColumn(Bytes.toBytes("V1"),
> Bytes.toBytes("cluster_map"));
>                scan.addColumn(Bytes.toBytes("V1"),
> Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
> > Hi -
> >
> >
> >
> > I found some odd behavior with ResultScanner.next(). Usually the times
> > for next() are couple hundred ms. But occasionally the call to next
> > spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
> > but once in a while the call to next() itself is interrupted by the
> > ScannerTimeoutException after more than 60 seconds. It seems odd that
> > the call to next itself can be interrupted because "61107ms passed since
> > the last invocation, timeout is currently set to 60000"
> >
> > The only thing I can think of is that a GC kicks in after the call to
> > next begins, but before the call returns, and the server is still
> > ticking the timeout. But this seems to happen periodically with
> > regularity. The odds of the GC kicking in at that exact instant, so
> > often, seem kind of low.
> >
> > -geoff
> >
> >
> >
>
>

Re: ScannerTimeoutException

Posted by Sonal Goyal <so...@gmail.com>.

How are you connecting to the table? Are you using the HTablePool?

Best Regards,
Sonal
Crux: Reporting for HBase <https://github.com/sonalgoyal/crux>
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>





On Thu, Sep 8, 2011 at 8:53 PM, Geoff Hendrey <gh...@decarta.com> wrote:

> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the
> ScannerTimeoutException. Arghhh. This is really becoming a mission critical
> problem for my team...Furthermore, the problem becomes more and more
> frequent, to the point that the job almost never finishes. At the beginning
> of the job, the timout never happens. By the end, it's happening 9 out of 10
> attempts...
>
>                Scan scan = new Scan(Bytes.toBytes(key));
>                scan.setCaching(1);
>                scan.setMaxVersions(1);
>                scan.addFamily(Bytes.toBytes("V1"));
>                scan.addColumn(Bytes.toBytes("V1"),
> Bytes.toBytes("cluster_map"));
>                scan.addColumn(Bytes.toBytes("V1"),
> Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
> > Hi -
> >
> >
> >
> > I found some odd behavior with ResultScanner.next(). Usually the times
> > for next() are couple hundred ms. But occasionally the call to next
> > spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
> > but once in a while the call to next() itself is interrupted by the
> > ScannerTimeoutException after more than 60 seconds. It seems odd that
> > the call to next itself can be interrupted because "61107ms passed since
> > the last invocation, timeout is currently set to 60000"
> >
> > The only thing I can think of is that a GC kicks in after the call to
> > next begins, but before the call returns, and the server is still
> > ticking the timeout. But this seems to happen periodically with
> > regularity. The odds of the GC kicking in at that exact instant, so
> > often, seem kind of low.
> >
> > -geoff
> >
> >
> >
>
>

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

hfile.block.cache.size is default (.2). Regionservers have 14GB heap.

I did not disable major compactions.

-geoff

-----Original Message-----
From: Vidhyashankar Venkataraman [mailto:vidhyash@yahoo-inc.com] 
Sent: Thursday, September 08, 2011 11:08 AM
To: user@hbase.apache.org
Cc: Rohit Nigam; Tony Wang; Parmod Mehta
Subject: Re: ScannerTimeoutException

Jira for 3: HBASE-2077

Dince you are adding a column as opposed to a family, you are doing a sparse scan so it is likely that issue (3) is happening.

* Have you disabled major compactions as well?
 * How big is your block cache?

V

On 9/8/11 10:57 AM, "Geoff Hendrey" <gh...@decarta.com> wrote:

thanks:

1)   I have splitting disabled

2)   possible, but each reduce task uses its own JVM. Therefore, GC could explain an individual failure, but not the exponentially increasing failure rate. Further, I can concurrently start a second M/R job doing exactly the same thing, and it follows the same pattern: No failures at first…progressing to virtually all scanners timing out. Both jobs are scanning the same table.

3)   we are using 90.1…does the bug effect us? Jira issue number?

THANKS,

-geoff

From: Vidhyashankar Venkataraman [mailto:vidhyash@yahoo-inc.com]
Sent: Thursday, September 08, 2011 10:06 AM
To: user@hbase.apache.org; Geoff Hendrey
Subject: Re: ScannerTimeoutException

There are a few reasons why this occurs

1.      Regionserver is very busy. Splits and compactions together suck precious CPU/IO cycles.
2.      GC pauses. See if you can see any swaps (through iostat/vmstat) or any unusual pauses in GC logs.
3.      There is an existing scanner bug (which Hbase version are you using?) wherein the scanner does not send keepalive messages until it sees a row that matches your scan. So the scanner is busy in the regionserver trying to find a row that matches the scan filter while the client waits for a response.

There could be more reasons. This is what I could come up with off the cuff.
Cheers
V

On 9/8/11 8:34 AM, "Jan Lukavsk‡" <ja...@firma.seznam.cz> wrote:

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole
family. Although this should not affect the timeouts. If the timeouts
are getting more frequent, do you see compations in you RegionServer's
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

Re: ScannerTimeoutException

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

Jira for 3: HBASE-2077

Dince you are adding a column as opposed to a family, you are doing a sparse scan so it is likely that issue (3) is happening.

* Have you disabled major compactions as well?
 * How big is your block cache?

V

On 9/8/11 10:57 AM, "Geoff Hendrey" <gh...@decarta.com> wrote:

thanks:

1)   I have splitting disabled

2)   possible, but each reduce task uses its own JVM. Therefore, GC could explain an individual failure, but not the exponentially increasing failure rate. Further, I can concurrently start a second M/R job doing exactly the same thing, and it follows the same pattern: No failures at first…progressing to virtually all scanners timing out. Both jobs are scanning the same table.

3)   we are using 90.1…does the bug effect us? Jira issue number?

THANKS,

-geoff

From: Vidhyashankar Venkataraman [mailto:vidhyash@yahoo-inc.com]
Sent: Thursday, September 08, 2011 10:06 AM
To: user@hbase.apache.org; Geoff Hendrey
Subject: Re: ScannerTimeoutException

There are a few reasons why this occurs

1.      Regionserver is very busy. Splits and compactions together suck precious CPU/IO cycles.
2.      GC pauses. See if you can see any swaps (through iostat/vmstat) or any unusual pauses in GC logs.
3.      There is an existing scanner bug (which Hbase version are you using?) wherein the scanner does not send keepalive messages until it sees a row that matches your scan. So the scanner is busy in the regionserver trying to find a row that matches the scan filter while the client waits for a response.

There could be more reasons. This is what I could come up with off the cuff.
Cheers
V

On 9/8/11 8:34 AM, "Jan Lukavsk‡" <ja...@firma.seznam.cz> wrote:

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole
family. Although this should not affect the timeouts. If the timeouts
are getting more frequent, do you see compations in you RegionServer's
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

thanks:

 

1)   I have splitting disabled

2)   possible, but each reduce task uses its own JVM. Therefore, GC could explain an individual failure, but not the exponentially increasing failure rate. Further, I can concurrently start a second M/R job doing exactly the same thing, and it follows the same pattern: No failures at first…progressing to virtually all scanners timing out. Both jobs are scanning the same table.

3)   we are using 90.1…does the bug effect us? Jira issue number?

THANKS,

-geoff

 

From: Vidhyashankar Venkataraman [mailto:vidhyash@yahoo-inc.com] 
Sent: Thursday, September 08, 2011 10:06 AM
To: user@hbase.apache.org; Geoff Hendrey
Subject: Re: ScannerTimeoutException

 

There are a few reasons why this occurs

1.	Regionserver is very busy. Splits and compactions together suck precious CPU/IO cycles. 
2.	GC pauses. See if you can see any swaps (through iostat/vmstat) or any unusual pauses in GC logs. 
3.	There is an existing scanner bug (which Hbase version are you using?) wherein the scanner does not send keepalive messages until it sees a row that matches your scan. So the scanner is busy in the regionserver trying to find a row that matches the scan filter while the client waits for a response.


There could be more reasons. This is what I could come up with off the cuff.
Cheers
V


On 9/8/11 8:34 AM, "Jan Lukavsk‡" <ja...@firma.seznam.cz> wrote:

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole
family. Although this should not affect the timeouts. If the timeouts
are getting more frequent, do you see compations in you RegionServer's
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

Hi Jan -

the master logs are almost completely quiet. All that is going on is "skipping load balancing" at the time these timeouts occur. I will check on the answer to " Do the timeouts occur while scanning for the same row(s)?", however, I suspect the answer is no.

-geoff

-----Original Message-----
From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz] 
Sent: Thursday, September 08, 2011 8:34 AM
To: Geoff Hendrey
Cc: user@hbase.apache.org
Subject: Re: ScannerTimeoutException

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole 
family. Although this should not affect the timeouts. If the timeouts 
are getting more frequent, do you see compations in you RegionServer's 
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

Re: ScannerTimeoutException

Posted by Vidhyashankar Venkataraman <vi...@yahoo-inc.com>.

There are a few reasons why this occurs

 1.  Regionserver is very busy. Splits and compactions together suck precious CPU/IO cycles.
 2.  GC pauses. See if you can see any swaps (through iostat/vmstat) or any unusual pauses in GC logs.
 3.  There is an existing scanner bug (which Hbase version are you using?) wherein the scanner does not send keepalive messages until it sees a row that matches your scan. So the scanner is busy in the regionserver trying to find a row that matches the scan filter while the client waits for a response.

There could be more reasons. This is what I could come up with off the cuff.
Cheers
V


On 9/8/11 8:34 AM, "Jan Lukavsk‡" <ja...@firma.seznam.cz> wrote:

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole
family. Although this should not affect the timeouts. If the timeouts
are getting more frequent, do you see compations in you RegionServer's
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

Re: ScannerTimeoutException

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Geoff,

I believe you don't need the scan.addColumn() when you add the whole 
family. Although this should not affect the timeouts. If the timeouts 
are getting more frequent, do you see compations in you RegionServer's 
log? Do the timeouts occur while scanning for the same row(s)?

Jan

On 8.9.2011 17:23, Geoff Hendrey wrote:
> Hi Jan -
>
> The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...
>
>                  Scan scan = new Scan(Bytes.toBytes(key));
>                  scan.setCaching(1);
>                  scan.setMaxVersions(1);
>                  scan.addFamily(Bytes.toBytes("V1"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
>                  scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));
>
> -geoff
>
> -----Original Message-----
> From: Geoff Hendrey
> Sent: Tuesday, September 06, 2011 9:51 AM
> To: Jan Lukavský; user@hbase.apache.org
> Cc: hbase-user@hadoop.apache.org
> Subject: RE: ScannerTimeoutException
>
> I'll try your suggestions!
>
> -----Original Message-----
> From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz]
> Sent: Tuesday, September 06, 2011 9:48 AM
> To: user@hbase.apache.org
> Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
> Subject: Re: ScannerTimeoutException
>
> Hi Geoff,
>
> we are having these issues when the scanner uses scan.addColumn() and
> the column is sparse, in the sense there are many rows with some other
> column in the same column family. I suppose your problems will vanish if
> you use scan.addFamily() call instead. The same behavior may appear if
> you are reading from region after massive delete (then the timeouts
> settle down after major compation), or when using server side Filters.
>
> Changing scan.addColumn() to scan.addFamily() brings some overhead,
> which I think could be removed by RegionServer renewing the lease of
> scanner while reading data, not only after first entry to
> HRegionServer.next().
>
> Would this be worth opening a JIRA?
>
> Jan
>
> On 6.9.2011 04:11, Geoff Hendrey wrote:
>> Hi -
>>
>>
>>
>> I found some odd behavior with ResultScanner.next(). Usually the times
>> for next() are couple hundred ms. But occasionally the call to next
>> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
>> but once in a while the call to next() itself is interrupted by the
>> ScannerTimeoutException after more than 60 seconds. It seems odd that
>> the call to next itself can be interrupted because "61107ms passed since
>> the last invocation, timeout is currently set to 60000"
>>
>> The only thing I can think of is that a GC kicks in after the call to
>> next begins, but before the call returns, and the server is still
>> ticking the timeout. But this seems to happen periodically with
>> regularity. The odds of the GC kicking in at that exact instant, so
>> often, seem kind of low.
>>
>> -geoff
>>
>>
>>

RE: ScannerTimeoutException

Posted by Geoff Hendrey <gh...@decarta.com>.

Hi Jan -

The relevant code looks like this. I added addFamily. Still getting the ScannerTimeoutException. Arghhh. This is really becoming a mission critical problem for my team...Furthermore, the problem becomes more and more frequent, to the point that the job almost never finishes. At the beginning of the job, the timout never happens. By the end, it's happening 9 out of 10 attempts...

                Scan scan = new Scan(Bytes.toBytes(key));
                scan.setCaching(1);
                scan.setMaxVersions(1);
                scan.addFamily(Bytes.toBytes("V1")); 
                scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("cluster_map"));
                scan.addColumn(Bytes.toBytes("V1"), Bytes.toBytes("version_control_number"));

-geoff

-----Original Message-----
From: Geoff Hendrey 
Sent: Tuesday, September 06, 2011 9:51 AM
To: Jan Lukavský; user@hbase.apache.org
Cc: hbase-user@hadoop.apache.org
Subject: RE: ScannerTimeoutException

I'll try your suggestions!

-----Original Message-----
From: Jan Lukavský [mailto:jan.lukavsky@firma.seznam.cz] 
Sent: Tuesday, September 06, 2011 9:48 AM
To: user@hbase.apache.org
Cc: Geoff Hendrey; hbase-user@hadoop.apache.org
Subject: Re: ScannerTimeoutException

Hi Geoff,

we are having these issues when the scanner uses scan.addColumn() and 
the column is sparse, in the sense there are many rows with some other 
column in the same column family. I suppose your problems will vanish if 
you use scan.addFamily() call instead. The same behavior may appear if 
you are reading from region after massive delete (then the timeouts 
settle down after major compation), or when using server side Filters.

Changing scan.addColumn() to scan.addFamily() brings some overhead, 
which I think could be removed by RegionServer renewing the lease of 
scanner while reading data, not only after first entry to 
HRegionServer.next().

Would this be worth opening a JIRA?

Jan

On 6.9.2011 04:11, Geoff Hendrey wrote:
> Hi -
>
>
>
> I found some odd behavior with ResultScanner.next(). Usually the times
> for next() are couple hundred ms. But occasionally the call to next
> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
> but once in a while the call to next() itself is interrupted by the
> ScannerTimeoutException after more than 60 seconds. It seems odd that
> the call to next itself can be interrupted because "61107ms passed since
> the last invocation, timeout is currently set to 60000"
>
> The only thing I can think of is that a GC kicks in after the call to
> next begins, but before the call returns, and the server is still
> ticking the timeout. But this seems to happen periodically with
> regularity. The odds of the GC kicking in at that exact instant, so
> often, seem kind of low.
>
> -geoff
>
>
>

Re: ScannerTimeoutException

Posted by Jan Lukavský <ja...@firma.seznam.cz>.

Hi Geoff,

we are having these issues when the scanner uses scan.addColumn() and 
the column is sparse, in the sense there are many rows with some other 
column in the same column family. I suppose your problems will vanish if 
you use scan.addFamily() call instead. The same behavior may appear if 
you are reading from region after massive delete (then the timeouts 
settle down after major compation), or when using server side Filters.

Changing scan.addColumn() to scan.addFamily() brings some overhead, 
which I think could be removed by RegionServer renewing the lease of 
scanner while reading data, not only after first entry to 
HRegionServer.next().

Would this be worth opening a JIRA?

Jan

On 6.9.2011 04:11, Geoff Hendrey wrote:
> Hi -
>
>
>
> I found some odd behavior with ResultScanner.next(). Usually the times
> for next() are couple hundred ms. But occasionally the call to next
> spikes VERY long. In fact, I have the timeout set to 60 seconds (60000),
> but once in a while the call to next() itself is interrupted by the
> ScannerTimeoutException after more than 60 seconds. It seems odd that
> the call to next itself can be interrupted because "61107ms passed since
> the last invocation, timeout is currently set to 60000"
>
> The only thing I can think of is that a GC kicks in after the call to
> next begins, but before the call returns, and the server is still
> ticking the timeout. But this seems to happen periodically with
> regularity. The odds of the GC kicking in at that exact instant, so
> often, seem kind of low.
>
> -geoff
>
>
>