You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Ferdy Galema <fe...@kalooga.com> on 2012/07/25 14:04:13 UTC

silently aborted scans when using hbase.client.scanner.max.result.size

I was experiencing aborted scans on certain conditions. In these cases I
was simply missing so many rows that only a fraction was inputted, without
warning. After lots of testing I was able to pinpoint and reproduce the
error when scanning over a single region, single column family, single
store file. So really just a single (major_compacted) storefile. I scan
over this region using a single Scan in a local jobtracker context. (So not
mapreduce, although this has exactly the same behaviour). Finally, I
noticed the number of input rows is dependent on the
hbase.client.scanner.caching property. See following example runs that
scans over this region with a specific start and stop key:

-Dhbase.client.scanner.caching=1
inputrows=1506

-Dhbase.client.scanner.caching=10000
inputrows=1240

-Dhbase.client.scanner.caching=1240
inputrows=1506

-Dhbase.client.scanner.caching=1241
inputrows=1240

This is weird huh? So setting the cache to 1241 in this case aborts the
scan silently. Removing the stoprow yields the same amout. Setting the
caching to 1 with no stoprow yields all rows. (Several hundreds of
thousands).

Neither the client nor the regionserver log any warning whatsoever. I had
the hbase.client.scanner.max.result.size set to 90100100. After removing
this property it all works like a charm!!! All rows are properly inputted,
regardless of hbase.client.scanner.caching. As an extra verification I
checked the regionserver for warnings that I would expect without this
property and this seems to be the case:
2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 8 on 60
020, responseTooLarge for: next(-1937592840574159040, 10000) from
x.x.x.x:39398: Size: 3
38.1m
2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
handler 9 on 60
020, responseTooLarge for: next(-1937592840574159040, 10000) from
x.x.x.x:39407: Size: 1
86.6m

So, anyone know what this could be? I am willing to debug this behaviour at
the regionserver level, but before I do I want to make sure I am not
running into something that has already been solved. This is
on hbase-0.90.6-cdh3u4, using snappy.

Re: silently aborted scans when using hbase.client.scanner.max.result.size

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Damn! Well that's a big bug then but it seems that HBASE-2214 would
fix it since the client would pass it's own maxsize? Although, reading
the patch, it doesn't seem so since if it wasn't configured on the
client and it wasn't passed on the Scan then the region server will
pickup

In the patch:

-      this.maxScannerResultSize = conf.getLong(
+      if (scan.getMaxResultSize() > 0) {
+        this.maxScannerResultSize = scan.getMaxResultSize();
+      } else {
+        this.maxScannerResultSize = conf.getLong(
           HConstants.HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE_KEY,
           HConstants.DEFAULT_HBASE_CLIENT_SCANNER_MAX_RESULT_SIZE);
+      }

If in the else clause you set the new value on the scan then the
region server would always receive the right amount of data. Then you
have to wonder why the region server would even set its own since it's
just likely to cause trouble. Or maybe it's the client that shouldn't
care.

I'll add a comment to that jira too.

J-D

On Thu, Jul 26, 2012 at 1:05 AM, Ferdy Galema <fe...@kalooga.com> wrote:
> Thanks man!! It is really that simple! That is crazy. I've been running
> this property serverside-only for such a long time but never really
> experienced the effects until using a higher caching value. (Which is
> perfectly explainable). Wherever this property is mentioned, is surely must
> be documented that it is critical to use it both server and client. (Unless
> you enjoy missing rows at random.)
>
> Thanks again.
> Ferdy
>
> On Wed, Jul 25, 2012 at 9:07 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> That looks nasty.
>>
>> Could it be that your client doesn't know about the max result size?
>> Looking at ClientScanner.next() we iterate while this is true:
>>
>> } while (remainingResultSize > 0 && countdown > 0 &&
>> nextScanner(countdown, values == null));
>>
>> Let's say the region server returns less rows than needed, like 1240,
>> but the caching is set to 1241. The remaining size would still be
>> higher than zero and so would the countdown (its value would be 1). So
>> it's gonna try to get the nextScanner. If you have just one region it
>> would stop there.
>>
>> But that would be the case if you have 1 region and did not set the
>> config on the client-side.
>>
>> J-D
>>
>> On Wed, Jul 25, 2012 at 5:04 AM, Ferdy Galema <fe...@kalooga.com>
>> wrote:
>> > I was experiencing aborted scans on certain conditions. In these cases I
>> > was simply missing so many rows that only a fraction was inputted,
>> without
>> > warning. After lots of testing I was able to pinpoint and reproduce the
>> > error when scanning over a single region, single column family, single
>> > store file. So really just a single (major_compacted) storefile. I scan
>> > over this region using a single Scan in a local jobtracker context. (So
>> not
>> > mapreduce, although this has exactly the same behaviour). Finally, I
>> > noticed the number of input rows is dependent on the
>> > hbase.client.scanner.caching property. See following example runs that
>> > scans over this region with a specific start and stop key:
>> >
>> > -Dhbase.client.scanner.caching=1
>> > inputrows=1506
>> >
>> > -Dhbase.client.scanner.caching=10000
>> > inputrows=1240
>> >
>> > -Dhbase.client.scanner.caching=1240
>> > inputrows=1506
>> >
>> > -Dhbase.client.scanner.caching=1241
>> > inputrows=1240
>> >
>> > This is weird huh? So setting the cache to 1241 in this case aborts the
>> > scan silently. Removing the stoprow yields the same amout. Setting the
>> > caching to 1 with no stoprow yields all rows. (Several hundreds of
>> > thousands).
>> >
>> > Neither the client nor the regionserver log any warning whatsoever. I had
>> > the hbase.client.scanner.max.result.size set to 90100100. After removing
>> > this property it all works like a charm!!! All rows are properly
>> inputted,
>> > regardless of hbase.client.scanner.caching. As an extra verification I
>> > checked the regionserver for warnings that I would expect without this
>> > property and this seems to be the case:
>> > 2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>> > handler 8 on 60
>> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
>> > x.x.x.x:39398: Size: 3
>> > 38.1m
>> > 2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC
>> Server
>> > handler 9 on 60
>> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
>> > x.x.x.x:39407: Size: 1
>> > 86.6m
>> >
>> > So, anyone know what this could be? I am willing to debug this behaviour
>> at
>> > the regionserver level, but before I do I want to make sure I am not
>> > running into something that has already been solved. This is
>> > on hbase-0.90.6-cdh3u4, using snappy.
>>

Re: silently aborted scans when using hbase.client.scanner.max.result.size

Posted by Ferdy Galema <fe...@kalooga.com>.
Thanks man!! It is really that simple! That is crazy. I've been running
this property serverside-only for such a long time but never really
experienced the effects until using a higher caching value. (Which is
perfectly explainable). Wherever this property is mentioned, is surely must
be documented that it is critical to use it both server and client. (Unless
you enjoy missing rows at random.)

Thanks again.
Ferdy

On Wed, Jul 25, 2012 at 9:07 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> That looks nasty.
>
> Could it be that your client doesn't know about the max result size?
> Looking at ClientScanner.next() we iterate while this is true:
>
> } while (remainingResultSize > 0 && countdown > 0 &&
> nextScanner(countdown, values == null));
>
> Let's say the region server returns less rows than needed, like 1240,
> but the caching is set to 1241. The remaining size would still be
> higher than zero and so would the countdown (its value would be 1). So
> it's gonna try to get the nextScanner. If you have just one region it
> would stop there.
>
> But that would be the case if you have 1 region and did not set the
> config on the client-side.
>
> J-D
>
> On Wed, Jul 25, 2012 at 5:04 AM, Ferdy Galema <fe...@kalooga.com>
> wrote:
> > I was experiencing aborted scans on certain conditions. In these cases I
> > was simply missing so many rows that only a fraction was inputted,
> without
> > warning. After lots of testing I was able to pinpoint and reproduce the
> > error when scanning over a single region, single column family, single
> > store file. So really just a single (major_compacted) storefile. I scan
> > over this region using a single Scan in a local jobtracker context. (So
> not
> > mapreduce, although this has exactly the same behaviour). Finally, I
> > noticed the number of input rows is dependent on the
> > hbase.client.scanner.caching property. See following example runs that
> > scans over this region with a specific start and stop key:
> >
> > -Dhbase.client.scanner.caching=1
> > inputrows=1506
> >
> > -Dhbase.client.scanner.caching=10000
> > inputrows=1240
> >
> > -Dhbase.client.scanner.caching=1240
> > inputrows=1506
> >
> > -Dhbase.client.scanner.caching=1241
> > inputrows=1240
> >
> > This is weird huh? So setting the cache to 1241 in this case aborts the
> > scan silently. Removing the stoprow yields the same amout. Setting the
> > caching to 1 with no stoprow yields all rows. (Several hundreds of
> > thousands).
> >
> > Neither the client nor the regionserver log any warning whatsoever. I had
> > the hbase.client.scanner.max.result.size set to 90100100. After removing
> > this property it all works like a charm!!! All rows are properly
> inputted,
> > regardless of hbase.client.scanner.caching. As an extra verification I
> > checked the regionserver for warnings that I would expect without this
> > property and this seems to be the case:
> > 2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 8 on 60
> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
> > x.x.x.x:39398: Size: 3
> > 38.1m
> > 2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC
> Server
> > handler 9 on 60
> > 020, responseTooLarge for: next(-1937592840574159040, 10000) from
> > x.x.x.x:39407: Size: 1
> > 86.6m
> >
> > So, anyone know what this could be? I am willing to debug this behaviour
> at
> > the regionserver level, but before I do I want to make sure I am not
> > running into something that has already been solved. This is
> > on hbase-0.90.6-cdh3u4, using snappy.
>

Re: silently aborted scans when using hbase.client.scanner.max.result.size

Posted by Jean-Daniel Cryans <jd...@apache.org>.
That looks nasty.

Could it be that your client doesn't know about the max result size?
Looking at ClientScanner.next() we iterate while this is true:

} while (remainingResultSize > 0 && countdown > 0 &&
nextScanner(countdown, values == null));

Let's say the region server returns less rows than needed, like 1240,
but the caching is set to 1241. The remaining size would still be
higher than zero and so would the countdown (its value would be 1). So
it's gonna try to get the nextScanner. If you have just one region it
would stop there.

But that would be the case if you have 1 region and did not set the
config on the client-side.

J-D

On Wed, Jul 25, 2012 at 5:04 AM, Ferdy Galema <fe...@kalooga.com> wrote:
> I was experiencing aborted scans on certain conditions. In these cases I
> was simply missing so many rows that only a fraction was inputted, without
> warning. After lots of testing I was able to pinpoint and reproduce the
> error when scanning over a single region, single column family, single
> store file. So really just a single (major_compacted) storefile. I scan
> over this region using a single Scan in a local jobtracker context. (So not
> mapreduce, although this has exactly the same behaviour). Finally, I
> noticed the number of input rows is dependent on the
> hbase.client.scanner.caching property. See following example runs that
> scans over this region with a specific start and stop key:
>
> -Dhbase.client.scanner.caching=1
> inputrows=1506
>
> -Dhbase.client.scanner.caching=10000
> inputrows=1240
>
> -Dhbase.client.scanner.caching=1240
> inputrows=1506
>
> -Dhbase.client.scanner.caching=1241
> inputrows=1240
>
> This is weird huh? So setting the cache to 1241 in this case aborts the
> scan silently. Removing the stoprow yields the same amout. Setting the
> caching to 1 with no stoprow yields all rows. (Several hundreds of
> thousands).
>
> Neither the client nor the regionserver log any warning whatsoever. I had
> the hbase.client.scanner.max.result.size set to 90100100. After removing
> this property it all works like a charm!!! All rows are properly inputted,
> regardless of hbase.client.scanner.caching. As an extra verification I
> checked the regionserver for warnings that I would expect without this
> property and this seems to be the case:
> 2012-07-25 11:46:52,889 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 8 on 60
> 020, responseTooLarge for: next(-1937592840574159040, 10000) from
> x.x.x.x:39398: Size: 3
> 38.1m
> 2012-07-25 11:47:14,359 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server
> handler 9 on 60
> 020, responseTooLarge for: next(-1937592840574159040, 10000) from
> x.x.x.x:39407: Size: 1
> 86.6m
>
> So, anyone know what this could be? I am willing to debug this behaviour at
> the regionserver level, but before I do I want to make sure I am not
> running into something that has already been solved. This is
> on hbase-0.90.6-cdh3u4, using snappy.