You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Varun Sharma <va...@pinterest.com> on 2013/05/15 20:32:49 UTC

Where is scanner startRow used

Hi,

Could someone please point me to where Scan.startRow is being used ?

>From what I can see in HRegion.RegionScannerImpl, it is unused. A grep does
not seem to return any valid entries. But my knowledge of this part is
limited.

We are debugging poor performance on prefix scans in tall schemas. If this
is really an issue, I will open a JIRA...

Varun

Re: Where is scanner startRow used

Posted by ramkrishna vasudevan <ra...@gmail.com>.

If this is a regression (but i don't think it to be) may be if you observe
this behaviour in any recent versions or it was like that in all the
version that you had used earlier that made you switch to wide schemas.

Regards
Ram


On Thu, May 16, 2013 at 2:27 AM, Varun Sharma <va...@pinterest.com> wrote:

> On Wed, May 15, 2013 at 1:20 PM, lars hofhansl <la...@apache.org> wrote:
>
> > Do you have some more details?
> >
> Yes,  the rows have 50 columns each when we use a wide schema.
> Unfortunately, this was a while back when we tried to go tall and found
> performance to be poor and eventually switched to wide. The reason why I
> say "unfortunately" is because I don't remember the exact performance
> numbers. Now we have a use case where we may have much wider rows (millions
> of columns) - so because of these outliars, we prefer tall. I probably
> should try reproducing the same test case again. We basically saw
> significantly more iowait and I/O with the tall schema v/s get schema as we
> upp'ed the load.
>
>
> > Why would a scan in a tall schema be all over the place but in a wide
> > schema it is not?
> >
> It is random in both cases - the scans are as random as the gets. Probably
> a mistake in my email below.
>
> > How wide were the rows before? About 50 columns?
> >
> Yes 50 columns or so (could be upto 100 but not much).
>
> >
> >
> > -- Lars
> >
> >
> > ----- Original Message -----
> > From: Varun Sharma <va...@pinterest.com>
> > To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> > Cc:
> > Sent: Wednesday, May 15, 2013 11:58 AM
> > Subject: Re: Where is scanner startRow used
> >
> > Yeah i just checked that we were already using startRow and its still
> > significantly poorer performance than the wide schema (close to unusable)
> >
> > We are doing scans of 50 batch size but the scans are all over the place
> -
> > very random because the schema is tall and not wide. I have created a
> JIRA
> > for the same and I will report performance numbers there. But to me, not
> > seeking to the start row within a region feels clearly suboptimal.
> >
> > Thanks
> > Varun
> >
> >
> > On Wed, May 15, 2013 at 11:48 AM, Anoop John <an...@gmail.com>
> > wrote:
> >
> > > At client side see ScannerCallable where this is passed to
> > > ServerCallable..  Based on this only which regions should be 1st
> scanned
> > is
> > > decided..
> > > I think some time back also the prefix filter was discussed. At that
> time
> > > also the conclusion was to use the start row. U can set a start row now
> > > right?  Pls check the perf with this once.
> > >
> > > -Anoop-
> > >
> > >
> > > On Thu, May 16, 2013 at 12:02 AM, Varun Sharma <va...@pinterest.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Could someone please point me to where Scan.startRow is being used ?
> > > >
> > > > From what I can see in HRegion.RegionScannerImpl, it is unused. A
> grep
> > > does
> > > > not seem to return any valid entries. But my knowledge of this part
> is
> > > > limited.
> > > >
> > > > We are debugging poor performance on prefix scans in tall schemas. If
> > > this
> > > > is really an issue, I will open a JIRA...
> > > >
> > > > Varun
> > > >
> > >
> >
> >
>

Re: Where is scanner startRow used

Posted by Varun Sharma <va...@pinterest.com>.

On Wed, May 15, 2013 at 1:20 PM, lars hofhansl <la...@apache.org> wrote:

> Do you have some more details?
>
Yes,  the rows have 50 columns each when we use a wide schema.
Unfortunately, this was a while back when we tried to go tall and found
performance to be poor and eventually switched to wide. The reason why I
say "unfortunately" is because I don't remember the exact performance
numbers. Now we have a use case where we may have much wider rows (millions
of columns) - so because of these outliars, we prefer tall. I probably
should try reproducing the same test case again. We basically saw
significantly more iowait and I/O with the tall schema v/s get schema as we
upp'ed the load.


> Why would a scan in a tall schema be all over the place but in a wide
> schema it is not?
>
It is random in both cases - the scans are as random as the gets. Probably
a mistake in my email below.

> How wide were the rows before? About 50 columns?
>
Yes 50 columns or so (could be upto 100 but not much).

>
>
> -- Lars
>
>
> ----- Original Message -----
> From: Varun Sharma <va...@pinterest.com>
> To: "dev@hbase.apache.org" <de...@hbase.apache.org>
> Cc:
> Sent: Wednesday, May 15, 2013 11:58 AM
> Subject: Re: Where is scanner startRow used
>
> Yeah i just checked that we were already using startRow and its still
> significantly poorer performance than the wide schema (close to unusable)
>
> We are doing scans of 50 batch size but the scans are all over the place -
> very random because the schema is tall and not wide. I have created a JIRA
> for the same and I will report performance numbers there. But to me, not
> seeking to the start row within a region feels clearly suboptimal.
>
> Thanks
> Varun
>
>
> On Wed, May 15, 2013 at 11:48 AM, Anoop John <an...@gmail.com>
> wrote:
>
> > At client side see ScannerCallable where this is passed to
> > ServerCallable..  Based on this only which regions should be 1st scanned
> is
> > decided..
> > I think some time back also the prefix filter was discussed. At that time
> > also the conclusion was to use the start row. U can set a start row now
> > right?  Pls check the perf with this once.
> >
> > -Anoop-
> >
> >
> > On Thu, May 16, 2013 at 12:02 AM, Varun Sharma <va...@pinterest.com>
> > wrote:
> >
> > > Hi,
> > >
> > > Could someone please point me to where Scan.startRow is being used ?
> > >
> > > From what I can see in HRegion.RegionScannerImpl, it is unused. A grep
> > does
> > > not seem to return any valid entries. But my knowledge of this part is
> > > limited.
> > >
> > > We are debugging poor performance on prefix scans in tall schemas. If
> > this
> > > is really an issue, I will open a JIRA...
> > >
> > > Varun
> > >
> >
>
>

Re: Where is scanner startRow used

Posted by lars hofhansl <la...@apache.org>.

Do you have some more details?
Why would a scan in a tall schema be all over the place but in a wide schema it is not?
How wide were the rows before? About 50 columns?


-- Lars


----- Original Message -----
From: Varun Sharma <va...@pinterest.com>
To: "dev@hbase.apache.org" <de...@hbase.apache.org>
Cc: 
Sent: Wednesday, May 15, 2013 11:58 AM
Subject: Re: Where is scanner startRow used

Yeah i just checked that we were already using startRow and its still
significantly poorer performance than the wide schema (close to unusable)

We are doing scans of 50 batch size but the scans are all over the place -
very random because the schema is tall and not wide. I have created a JIRA
for the same and I will report performance numbers there. But to me, not
seeking to the start row within a region feels clearly suboptimal.

Thanks
Varun


On Wed, May 15, 2013 at 11:48 AM, Anoop John <an...@gmail.com> wrote:

> At client side see ScannerCallable where this is passed to
> ServerCallable..  Based on this only which regions should be 1st scanned is
> decided..
> I think some time back also the prefix filter was discussed. At that time
> also the conclusion was to use the start row. U can set a start row now
> right?  Pls check the perf with this once.
>
> -Anoop-
>
>
> On Thu, May 16, 2013 at 12:02 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Hi,
> >
> > Could someone please point me to where Scan.startRow is being used ?
> >
> > From what I can see in HRegion.RegionScannerImpl, it is unused. A grep
> does
> > not seem to return any valid entries. But my knowledge of this part is
> > limited.
> >
> > We are debugging poor performance on prefix scans in tall schemas. If
> this
> > is really an issue, I will open a JIRA...
> >
> > Varun
> >
>

Re: Where is scanner startRow used

Posted by Varun Sharma <va...@pinterest.com>.

Yeah i just checked that we were already using startRow and its still
significantly poorer performance than the wide schema (close to unusable)

We are doing scans of 50 batch size but the scans are all over the place -
very random because the schema is tall and not wide. I have created a JIRA
for the same and I will report performance numbers there. But to me, not
seeking to the start row within a region feels clearly suboptimal.

Thanks
Varun


On Wed, May 15, 2013 at 11:48 AM, Anoop John <an...@gmail.com> wrote:

> At client side see ScannerCallable where this is passed to
> ServerCallable..  Based on this only which regions should be 1st scanned is
> decided..
> I think some time back also the prefix filter was discussed. At that time
> also the conclusion was to use the start row. U can set a start row now
> right?  Pls check the perf with this once.
>
> -Anoop-
>
>
> On Thu, May 16, 2013 at 12:02 AM, Varun Sharma <va...@pinterest.com>
> wrote:
>
> > Hi,
> >
> > Could someone please point me to where Scan.startRow is being used ?
> >
> > From what I can see in HRegion.RegionScannerImpl, it is unused. A grep
> does
> > not seem to return any valid entries. But my knowledge of this part is
> > limited.
> >
> > We are debugging poor performance on prefix scans in tall schemas. If
> this
> > is really an issue, I will open a JIRA...
> >
> > Varun
> >
>

Re: Where is scanner startRow used

Posted by Anoop John <an...@gmail.com>.

At client side see ScannerCallable where this is passed to
ServerCallable..  Based on this only which regions should be 1st scanned is
decided..
I think some time back also the prefix filter was discussed. At that time
also the conclusion was to use the start row. U can set a start row now
right?  Pls check the perf with this once.

-Anoop-

On Thu, May 16, 2013 at 12:02 AM, Varun Sharma <va...@pinterest.com> wrote:

> Hi,
>
> Could someone please point me to where Scan.startRow is being used ?
>
> From what I can see in HRegion.RegionScannerImpl, it is unused. A grep does
> not seem to return any valid entries. But my knowledge of this part is
> limited.
>
> We are debugging poor performance on prefix scans in tall schemas. If this
> is really an issue, I will open a JIRA...
>
> Varun
>