You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Peter Wolf <op...@gmail.com> on 2012/01/25 13:56:02 UTC
Speeding up Scans
Hello all,
I am looking for advice on speeding up my Scanning.
I want to iterate over all rows where a particular column (language)
equals a particular value ("JA").
I am already creating my row keys using that column in the first bytes.
And I do my scans using partial row matching, like this...
public static byte[] calculateStartRowKey(String language) {
int languageHash = language.length() > 0 ? language.hashCode() : 0;
byte[] language2 = Bytes.toBytes(languageHash);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
}
public static byte[] calculateEndRowKey(String language) {
int languageHash = language.length() > 0 ? language.hashCode() : 0;
byte[] language2 = Bytes.toBytes(languageHash + 1);
byte[] accountID2 = Bytes.toBytes(0);
byte[] timestamp2 = Bytes.toBytes(0);
return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
}
Scan scan = new Scan(calculateStartRowKey(language),
calculateEndRowKey(language));
Since I am using a hash value for the string, I need to re-check the
column to make sure that some other string does not get the same hash value
Filter filter = new SingleColumnValueFilter(resultFamily,
languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
scan.setFilter(filter);
I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on EC2.
I think that this should be really fast, but it is not. Any advice on
how to debug/speed it up?
Thanks
Peter
Re: Speeding up Scans
Posted by Michael Segel <mi...@hotmail.com>.
I'm confused...
You mention that you are hashing your key, and you want to do a scan w a start and stop value?
Could you elaborate?
With respect to hashing, if you use a SHA-1 hash, your values will be unique.
(you talked about rehashing ...)
Sent from my iPhone
On Jan 25, 2012, at 7:56 AM, "Peter Wolf" <op...@gmail.com> wrote:
> Hello all,
>
> I am looking for advice on speeding up my Scanning.
>
> I want to iterate over all rows where a particular column (language) equals a particular value ("JA").
>
> I am already creating my row keys using that column in the first bytes. And I do my scans using partial row matching, like this...
>
> public static byte[] calculateStartRowKey(String language) {
> int languageHash = language.length() > 0 ? language.hashCode() : 0;
> byte[] language2 = Bytes.toBytes(languageHash);
> byte[] accountID2 = Bytes.toBytes(0);
> byte[] timestamp2 = Bytes.toBytes(0);
> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
> }
>
> public static byte[] calculateEndRowKey(String language) {
> int languageHash = language.length() > 0 ? language.hashCode() : 0;
> byte[] language2 = Bytes.toBytes(languageHash + 1);
> byte[] accountID2 = Bytes.toBytes(0);
> byte[] timestamp2 = Bytes.toBytes(0);
> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
> }
>
> Scan scan = new Scan(calculateStartRowKey(language), calculateEndRowKey(language));
>
>
> Since I am using a hash value for the string, I need to re-check the column to make sure that some other string does not get the same hash value
>
> Filter filter = new SingleColumnValueFilter(resultFamily, languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
> scan.setFilter(filter);
>
> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on EC2.
>
> I think that this should be really fast, but it is not. Any advice on how to debug/speed it up?
>
> Thanks
> Peter
>
>
>
>
Re: Speeding up Scans
Posted by Jean-Daniel Cryans <jd...@apache.org>.
If you're running a full scan (what PE scan does) on a table that
doesn't fit in the block cache, setting setCacheBlocks(true) is the
last thing you want to do (unless you fancy getting massive cache
churn).
33k does sound awfully low.
J-D
On Thu, Jan 26, 2012 at 6:54 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Hey Peter,
>
> I am trying to benchmark our 3 node cluster now and trying to optimize
> for scanning.
> Using the PerformanceEvaluation tool I did a random write to populate
> 5M rows (I believe they are 1k each but whatever the tool does by
> default).
>
> I am seeing 33k records per second (which I believe to be too low)
> with the following.
> scan.setCacheBlocks(true);
> scan.setCaching(10000);
>
> It might be worth using the PE
> (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
> load, as then you are using a known table and content to compare
> against.
>
> I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS).
>
> HTH,
> Tim
>
>
>
> On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <op...@gmail.com> wrote:
>> Thank you Doug and Geoff,
>>
>> After following your advice I am now up to about 100 rows a second. Is that
>> considered fast for HBase?
>>
>> My data is not big, and I only have 100,000's of rows in my table at the
>> moment.
>>
>> Do I still have a tuning problem? How fast should I expect?
>>
>> Thanks
>>
>> Peter
>>
>>
>>
>> On 1/25/12 2:32 PM, Doug Meil wrote:
>>>
>>> Thanks Geoff! No apology required, that's good stuff. I'll update the
>>> book with that param.
>>>
>>>
>>>
>>>
>>> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com> wrote:
>>>
>>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>>> in some findings (reported to this list by us a while back) that helped
>>>> us to get scans to perform very fast. Adjusting
>>>> hbase.client.prefetch.limit was critical for us.:
>>>> ========================
>>>> It's even more mysterious than we think. There is lack of documentation
>>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>>> decide the performance of scan.
>>>>
>>>> 1. Scanner cache as we know - We always had scanner caching set to
>>>> 1, but this is different than pre fetch limit
>>>> 2. hbase.client.prefetch.limit - This is meta caching limit
>>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>>> is not already been pre-warmed
>>>>
>>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>>> prefetch the next 10 region locations.
>>>>
>>>> int rows = Math.min(rowLimit,
>>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>>
>>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>>> boundaries. Hence every new region boundary that is not already been
>>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>>> query followed by quick responses. This is basically pre-warming the
>>>> meta not region cache.
>>>>
>>>> -----Original Message-----
>>>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Speeding up Scans
>>>>
>>>> Does it make sense to have better defaults so the performance out of the
>>>> box is better?
>>>>
>>>> ~Jeff
>>>>
>>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>>
>>>>> Ah ha! I appear to be insane ;-)
>>>>>
>>>>> Adding the following speeded things up quite a bit
>>>>>
>>>>> scan.setCacheBlocks(true);
>>>>> scan.setCaching(1000);
>>>>>
>>>>> Thank you, it was a duh!
>>>>>
>>>>> P
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>>
>>>>>> Hi there-
>>>>>>
>>>>>> Quick sanity check: what caching level are you using? (default is
>>>>
>>>> 1) I
>>>>>>
>>>>>> know this is basic, but it's always good to double-check.
>>>>>>
>>>>>> If "language" is already in the lead position of the rowkey, why use
>>>>
>>>> the
>>>>>>
>>>>>> filter?
>>>>>>
>>>>>> As for EC2, that's a wildcard.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>>
>>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>>> equals a particular value ("JA").
>>>>>>>
>>>>>>> I am already creating my row keys using that column in the first
>>>>
>>>> bytes.
>>>>>>>
>>>>>>> And I do my scans using partial row matching, like this...
>>>>>>>
>>>>>>> public static byte[] calculateStartRowKey(String language) {
>>>>>>> int languageHash = language.length()> 0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> public static byte[] calculateEndRowKey(String language) {
>>>>>>> int languageHash = language.length()> 0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>> }
>>>>>>>
>>>>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>>>>> calculateEndRowKey(language));
>>>>>>>
>>>>>>>
>>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>>> column to make sure that some other string does not get the same
>>>>
>>>> hash
>>>>>>>
>>>>>>> value
>>>>>>>
>>>>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>>
>>>> Bytes.toBytes(language));
>>>>>>>
>>>>>>> scan.setFilter(filter);
>>>>>>>
>>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>>
>>>> on
>>>>>>>
>>>>>>> EC2.
>>>>>>>
>>>>>>> I think that this should be really fast, but it is not. Any advice
>>>>
>>>> on
>>>>>>>
>>>>>>> how to debug/speed it up?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> --
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> jeffw@qualtrics.com
>>>>
>>>>
>>>
>>
Re: Speeding up Scans
Posted by Tim Robertson <ti...@gmail.com>.
Hey Peter,
I am trying to benchmark our 3 node cluster now and trying to optimize
for scanning.
Using the PerformanceEvaluation tool I did a random write to populate
5M rows (I believe they are 1k each but whatever the tool does by
default).
I am seeing 33k records per second (which I believe to be too low)
with the following.
scan.setCacheBlocks(true);
scan.setCaching(10000);
It might be worth using the PE
(http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
load, as then you are using a known table and content to compare
against.
I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS).
HTH,
Tim
On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <op...@gmail.com> wrote:
> Thank you Doug and Geoff,
>
> After following your advice I am now up to about 100 rows a second. Is that
> considered fast for HBase?
>
> My data is not big, and I only have 100,000's of rows in my table at the
> moment.
>
> Do I still have a tuning problem? How fast should I expect?
>
> Thanks
>
> Peter
>
>
>
> On 1/25/12 2:32 PM, Doug Meil wrote:
>>
>> Thanks Geoff! No apology required, that's good stuff. I'll update the
>> book with that param.
>>
>>
>>
>>
>> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com> wrote:
>>
>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>> in some findings (reported to this list by us a while back) that helped
>>> us to get scans to perform very fast. Adjusting
>>> hbase.client.prefetch.limit was critical for us.:
>>> ========================
>>> It's even more mysterious than we think. There is lack of documentation
>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>> decide the performance of scan.
>>>
>>> 1. Scanner cache as we know - We always had scanner caching set to
>>> 1, but this is different than pre fetch limit
>>> 2. hbase.client.prefetch.limit - This is meta caching limit
>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>> is not already been pre-warmed
>>>
>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>> prefetch the next 10 region locations.
>>>
>>> int rows = Math.min(rowLimit,
>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>
>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>> boundaries. Hence every new region boundary that is not already been
>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>> query followed by quick responses. This is basically pre-warming the
>>> meta not region cache.
>>>
>>> -----Original Message-----
>>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: Speeding up Scans
>>>
>>> Does it make sense to have better defaults so the performance out of the
>>> box is better?
>>>
>>> ~Jeff
>>>
>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>
>>>> Ah ha! I appear to be insane ;-)
>>>>
>>>> Adding the following speeded things up quite a bit
>>>>
>>>> scan.setCacheBlocks(true);
>>>> scan.setCaching(1000);
>>>>
>>>> Thank you, it was a duh!
>>>>
>>>> P
>>>>
>>>>
>>>>
>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>
>>>>> Hi there-
>>>>>
>>>>> Quick sanity check: what caching level are you using? (default is
>>>
>>> 1) I
>>>>>
>>>>> know this is basic, but it's always good to double-check.
>>>>>
>>>>> If "language" is already in the lead position of the rowkey, why use
>>>
>>> the
>>>>>
>>>>> filter?
>>>>>
>>>>> As for EC2, that's a wildcard.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>
>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>> equals a particular value ("JA").
>>>>>>
>>>>>> I am already creating my row keys using that column in the first
>>>
>>> bytes.
>>>>>>
>>>>>> And I do my scans using partial row matching, like this...
>>>>>>
>>>>>> public static byte[] calculateStartRowKey(String language) {
>>>>>> int languageHash = language.length()> 0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>> }
>>>>>>
>>>>>> public static byte[] calculateEndRowKey(String language) {
>>>>>> int languageHash = language.length()> 0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>> }
>>>>>>
>>>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>>>> calculateEndRowKey(language));
>>>>>>
>>>>>>
>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>> column to make sure that some other string does not get the same
>>>
>>> hash
>>>>>>
>>>>>> value
>>>>>>
>>>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>
>>> Bytes.toBytes(language));
>>>>>>
>>>>>> scan.setFilter(filter);
>>>>>>
>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>
>>> on
>>>>>>
>>>>>> EC2.
>>>>>>
>>>>>> I think that this should be really fast, but it is not. Any advice
>>>
>>> on
>>>>>>
>>>>>> how to debug/speed it up?
>>>>>>
>>>>>> Thanks
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>
>>
>
Re: Speeding up Scans
Posted by Peter Wolf <op...@gmail.com>.
Thank you Doug and Geoff,
After following your advice I am now up to about 100 rows a second. Is
that considered fast for HBase?
My data is not big, and I only have 100,000's of rows in my table at the
moment.
Do I still have a tuning problem? How fast should I expect?
Thanks
Peter
On 1/25/12 2:32 PM, Doug Meil wrote:
> Thanks Geoff! No apology required, that's good stuff. I'll update the
> book with that param.
>
>
>
>
> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com> wrote:
>
>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>> in some findings (reported to this list by us a while back) that helped
>> us to get scans to perform very fast. Adjusting
>> hbase.client.prefetch.limit was critical for us.:
>> ========================
>> It's even more mysterious than we think. There is lack of documentation
>> (or perhaps lack of know how). Apparently there are 2 factors that
>> decide the performance of scan.
>>
>> 1. Scanner cache as we know - We always had scanner caching set to
>> 1, but this is different than pre fetch limit
>> 2. hbase.client.prefetch.limit - This is meta caching limit
>> defaults to 10 to prefetch 10 region locations every time we scan that
>> is not already been pre-warmed
>>
>> the "hbase.client.prefetch.limit" is passed along to the client code to
>> prefetch the next 10 region locations.
>>
>> int rows = Math.min(rowLimit,
>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>
>> the "row" variable mins to 10 and always prefetch atmost 10 region
>> boundaries. Hence every new region boundary that is not already been
>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>> query followed by quick responses. This is basically pre-warming the
>> meta not region cache.
>>
>> -----Original Message-----
>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>> Sent: Wednesday, January 25, 2012 10:09 AM
>> To: user@hbase.apache.org
>> Subject: Re: Speeding up Scans
>>
>> Does it make sense to have better defaults so the performance out of the
>> box is better?
>>
>> ~Jeff
>>
>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>> Ah ha! I appear to be insane ;-)
>>>
>>> Adding the following speeded things up quite a bit
>>>
>>> scan.setCacheBlocks(true);
>>> scan.setCaching(1000);
>>>
>>> Thank you, it was a duh!
>>>
>>> P
>>>
>>>
>>>
>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>> Hi there-
>>>>
>>>> Quick sanity check: what caching level are you using? (default is
>> 1) I
>>>> know this is basic, but it's always good to double-check.
>>>>
>>>> If "language" is already in the lead position of the rowkey, why use
>> the
>>>> filter?
>>>>
>>>> As for EC2, that's a wildcard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am looking for advice on speeding up my Scanning.
>>>>>
>>>>> I want to iterate over all rows where a particular column (language)
>>>>> equals a particular value ("JA").
>>>>>
>>>>> I am already creating my row keys using that column in the first
>> bytes.
>>>>> And I do my scans using partial row matching, like this...
>>>>>
>>>>> public static byte[] calculateStartRowKey(String language) {
>>>>> int languageHash = language.length()> 0 ?
>> language.hashCode() :
>>>>> 0;
>>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>> }
>>>>>
>>>>> public static byte[] calculateEndRowKey(String language) {
>>>>> int languageHash = language.length()> 0 ?
>> language.hashCode() :
>>>>> 0;
>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>> }
>>>>>
>>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>>> calculateEndRowKey(language));
>>>>>
>>>>>
>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>> column to make sure that some other string does not get the same
>> hash
>>>>> value
>>>>>
>>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(language));
>>>>> scan.setFilter(filter);
>>>>>
>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>> on
>>>>> EC2.
>>>>>
>>>>> I think that this should be really fast, but it is not. Any advice
>> on
>>>>> how to debug/speed it up?
>>>>>
>>>>> Thanks
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> --
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>
Re: Speeding up Scans
Posted by Peter Wolf <op...@gmail.com>.
Interesting,
I added this, and my scan did speed up somewhat
conf.setInt("hbase.client.prefetch.limit",100);
hTable = new HTable(conf, tableName);
What does this environment variable really control, and how should it be
set to an appropriate value? What is a region, and how does it map to
lines, families and columns? What are the tradeoffs for making it big?
Peter
On 1/25/12 2:32 PM, Doug Meil wrote:
> Thanks Geoff! No apology required, that's good stuff. I'll update the
> book with that param.
>
>
>
>
> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com> wrote:
>
>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>> in some findings (reported to this list by us a while back) that helped
>> us to get scans to perform very fast. Adjusting
>> hbase.client.prefetch.limit was critical for us.:
>> ========================
>> It's even more mysterious than we think. There is lack of documentation
>> (or perhaps lack of know how). Apparently there are 2 factors that
>> decide the performance of scan.
>>
>> 1. Scanner cache as we know - We always had scanner caching set to
>> 1, but this is different than pre fetch limit
>> 2. hbase.client.prefetch.limit - This is meta caching limit
>> defaults to 10 to prefetch 10 region locations every time we scan that
>> is not already been pre-warmed
>>
>> the "hbase.client.prefetch.limit" is passed along to the client code to
>> prefetch the next 10 region locations.
>>
>> int rows = Math.min(rowLimit,
>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>
>> the "row" variable mins to 10 and always prefetch atmost 10 region
>> boundaries. Hence every new region boundary that is not already been
>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>> query followed by quick responses. This is basically pre-warming the
>> meta not region cache.
>>
>> -----Original Message-----
>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>> Sent: Wednesday, January 25, 2012 10:09 AM
>> To: user@hbase.apache.org
>> Subject: Re: Speeding up Scans
>>
>> Does it make sense to have better defaults so the performance out of the
>> box is better?
>>
>> ~Jeff
>>
>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>> Ah ha! I appear to be insane ;-)
>>>
>>> Adding the following speeded things up quite a bit
>>>
>>> scan.setCacheBlocks(true);
>>> scan.setCaching(1000);
>>>
>>> Thank you, it was a duh!
>>>
>>> P
>>>
>>>
>>>
>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>> Hi there-
>>>>
>>>> Quick sanity check: what caching level are you using? (default is
>> 1) I
>>>> know this is basic, but it's always good to double-check.
>>>>
>>>> If "language" is already in the lead position of the rowkey, why use
>> the
>>>> filter?
>>>>
>>>> As for EC2, that's a wildcard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am looking for advice on speeding up my Scanning.
>>>>>
>>>>> I want to iterate over all rows where a particular column (language)
>>>>> equals a particular value ("JA").
>>>>>
>>>>> I am already creating my row keys using that column in the first
>> bytes.
>>>>> And I do my scans using partial row matching, like this...
>>>>>
>>>>> public static byte[] calculateStartRowKey(String language) {
>>>>> int languageHash = language.length()> 0 ?
>> language.hashCode() :
>>>>> 0;
>>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>> }
>>>>>
>>>>> public static byte[] calculateEndRowKey(String language) {
>>>>> int languageHash = language.length()> 0 ?
>> language.hashCode() :
>>>>> 0;
>>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>>> return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>> }
>>>>>
>>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>>> calculateEndRowKey(language));
>>>>>
>>>>>
>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>> column to make sure that some other string does not get the same
>> hash
>>>>> value
>>>>>
>>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(language));
>>>>> scan.setFilter(filter);
>>>>>
>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>> on
>>>>> EC2.
>>>>>
>>>>> I think that this should be really fast, but it is not. Any advice
>> on
>>>>> how to debug/speed it up?
>>>>>
>>>>> Thanks
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> --
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>
Re: Speeding up Scans
Posted by Doug Meil <do...@explorysmedical.com>.
Thanks Geoff! No apology required, that's good stuff. I'll update the
book with that param.
On 1/25/12 2:17 PM, "Geoff Hendrey" <gh...@decarta.com> wrote:
>Sorry for jumping in late, and perhaps out of context, but I'm pasting
>in some findings (reported to this list by us a while back) that helped
>us to get scans to perform very fast. Adjusting
>hbase.client.prefetch.limit was critical for us.:
>========================
>It's even more mysterious than we think. There is lack of documentation
>(or perhaps lack of know how). Apparently there are 2 factors that
>decide the performance of scan.
>
>1. Scanner cache as we know - We always had scanner caching set to
>1, but this is different than pre fetch limit
>2. hbase.client.prefetch.limit - This is meta caching limit
>defaults to 10 to prefetch 10 region locations every time we scan that
>is not already been pre-warmed
>
>the "hbase.client.prefetch.limit" is passed along to the client code to
>prefetch the next 10 region locations.
>
>int rows = Math.min(rowLimit,
>configuration.getInt("hbase.meta.scanner.caching", 100));
>
>the "row" variable mins to 10 and always prefetch atmost 10 region
>boundaries. Hence every new region boundary that is not already been
>pre-warmed fetch the next 10 region locations resulting in 1st slow
>query followed by quick responses. This is basically pre-warming the
>meta not region cache.
>
>-----Original Message-----
>From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>Sent: Wednesday, January 25, 2012 10:09 AM
>To: user@hbase.apache.org
>Subject: Re: Speeding up Scans
>
>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha! I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>> scan.setCacheBlocks(true);
>> scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check: what caching level are you using? (default is
>1) I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>> public static byte[] calculateStartRowKey(String language) {
>>>> int languageHash = language.length()> 0 ?
>language.hashCode() :
>>>> 0;
>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>> return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>> }
>>>>
>>>> public static byte[] calculateEndRowKey(String language) {
>>>> int languageHash = language.length()> 0 ?
>language.hashCode() :
>>>> 0;
>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>> return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>> }
>>>>
>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same
>hash
>>>> value
>>>>
>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>Bytes.toBytes(language));
>>>> scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not. Any advice
>on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>--
>Jeff Whiting
>Qualtrics Senior Software Engineer
>jeffw@qualtrics.com
>
>
RE: Speeding up Scans
Posted by Geoff Hendrey <gh...@decarta.com>.
Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan.
1. Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2. hbase.client.prefetch.limit - This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed
the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.
int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));
the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.
-----Original Message-----
From: Jeff Whiting [mailto:jeffw@qualtrics.com]
Sent: Wednesday, January 25, 2012 10:09 AM
To: user@hbase.apache.org
Subject: Re: Speeding up Scans
Does it make sense to have better defaults so the performance out of the
box is better?
~Jeff
On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha! I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
> scan.setCacheBlocks(true);
> scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check: what caching level are you using? (default is
1) I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use
the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first
bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>> public static byte[] calculateStartRowKey(String language) {
>>> int languageHash = language.length()> 0 ?
language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>> }
>>>
>>> public static byte[] calculateEndRowKey(String language) {
>>> int languageHash = language.length()> 0 ?
language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>> }
>>>
>>> Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same
hash
>>> value
>>>
>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
>>> scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not. Any advice
on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>
--
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com
Re: Speeding up Scans
Posted by Doug Meil <do...@explorysmedical.com>.
I think this is one of those "damned if you do..." situations. If you
want to do a lot of quick single-record lookups (a Get is actually a Scan
underneath the covers), then "1" is what you want. But for MapReduce
jobs, or for scanning over a wide number of records like you're doing,
then you'll want the value higher.
On 1/25/12 1:09 PM, "Jeff Whiting" <je...@qualtrics.com> wrote:
>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha! I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>> scan.setCacheBlocks(true);
>> scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check: what caching level are you using? (default is 1)
>>> I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>>>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>>>>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>> public static byte[] calculateStartRowKey(String language) {
>>>> int languageHash = language.length()> 0 ?
>>>>language.hashCode() :
>>>> 0;
>>>> byte[] language2 = Bytes.toBytes(languageHash);
>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>> }
>>>>
>>>> public static byte[] calculateEndRowKey(String language) {
>>>> int languageHash = language.length()> 0 ?
>>>>language.hashCode() :
>>>> 0;
>>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>> byte[] accountID2 = Bytes.toBytes(0);
>>>> byte[] timestamp2 = Bytes.toBytes(0);
>>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>> }
>>>>
>>>> Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same hash
>>>> value
>>>>
>>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>> scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not. Any advice on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>--
>Jeff Whiting
>Qualtrics Senior Software Engineer
>jeffw@qualtrics.com
>
>
Re: Speeding up Scans
Posted by Jeff Whiting <je...@qualtrics.com>.
Does it make sense to have better defaults so the performance out of the box is better?
~Jeff
On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha! I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
> scan.setCacheBlocks(true);
> scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check: what caching level are you using? (default is 1) I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>> public static byte[] calculateStartRowKey(String language) {
>>> int languageHash = language.length()> 0 ? language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>> }
>>>
>>> public static byte[] calculateEndRowKey(String language) {
>>> int languageHash = language.length()> 0 ? language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>> }
>>>
>>> Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>> scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not. Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>
--
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com
Re: Speeding up Scans
Posted by Doug Meil <do...@explorysmedical.com>.
No problem! That's one of the tips in the Performance chapter of the
book/refGuide - always a good thing to double-check because even the most
experienced folks sometimes forget the simple stuff.
On 1/25/12 10:06 AM, "Peter Wolf" <op...@gmail.com> wrote:
>Ah ha! I appear to be insane ;-)
>
>Adding the following speeded things up quite a bit
>
> scan.setCacheBlocks(true);
> scan.setCaching(1000);
>
>Thank you, it was a duh!
>
>P
>
>
>
>On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check: what caching level are you using? (default is 1)
>>I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>> public static byte[] calculateStartRowKey(String language) {
>>> int languageHash = language.length()> 0 ?
>>>language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>> }
>>>
>>> public static byte[] calculateEndRowKey(String language) {
>>> int languageHash = language.length()> 0 ?
>>>language.hashCode() :
>>> 0;
>>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>>> byte[] accountID2 = Bytes.toBytes(0);
>>> byte[] timestamp2 = Bytes.toBytes(0);
>>> return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>> }
>>>
>>> Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>> Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>> scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not. Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>
>
Re: Speeding up Scans
Posted by Peter Wolf <op...@gmail.com>.
Ah ha! I appear to be insane ;-)
Adding the following speeded things up quite a bit
scan.setCacheBlocks(true);
scan.setCaching(1000);
Thank you, it was a duh!
P
On 1/25/12 8:13 AM, Doug Meil wrote:
> Hi there-
>
> Quick sanity check: what caching level are you using? (default is 1) I
> know this is basic, but it's always good to double-check.
>
> If "language" is already in the lead position of the rowkey, why use the
> filter?
>
> As for EC2, that's a wildcard.
>
>
>
>
>
> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com> wrote:
>
>> Hello all,
>>
>> I am looking for advice on speeding up my Scanning.
>>
>> I want to iterate over all rows where a particular column (language)
>> equals a particular value ("JA").
>>
>> I am already creating my row keys using that column in the first bytes.
>> And I do my scans using partial row matching, like this...
>>
>> public static byte[] calculateStartRowKey(String language) {
>> int languageHash = language.length()> 0 ? language.hashCode() :
>> 0;
>> byte[] language2 = Bytes.toBytes(languageHash);
>> byte[] accountID2 = Bytes.toBytes(0);
>> byte[] timestamp2 = Bytes.toBytes(0);
>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>> }
>>
>> public static byte[] calculateEndRowKey(String language) {
>> int languageHash = language.length()> 0 ? language.hashCode() :
>> 0;
>> byte[] language2 = Bytes.toBytes(languageHash + 1);
>> byte[] accountID2 = Bytes.toBytes(0);
>> byte[] timestamp2 = Bytes.toBytes(0);
>> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>> }
>>
>> Scan scan = new Scan(calculateStartRowKey(language),
>> calculateEndRowKey(language));
>>
>>
>> Since I am using a hash value for the string, I need to re-check the
>> column to make sure that some other string does not get the same hash
>> value
>>
>> Filter filter = new SingleColumnValueFilter(resultFamily,
>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>> scan.setFilter(filter);
>>
>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>> EC2.
>>
>> I think that this should be really fast, but it is not. Any advice on
>> how to debug/speed it up?
>>
>> Thanks
>> Peter
>>
>>
>>
>>
>>
>
Re: Speeding up Scans
Posted by Doug Meil <do...@explorysmedical.com>.
Hi there-
Quick sanity check: what caching level are you using? (default is 1) I
know this is basic, but it's always good to double-check.
If "language" is already in the lead position of the rowkey, why use the
filter?
As for EC2, that's a wildcard.
On 1/25/12 7:56 AM, "Peter Wolf" <op...@gmail.com> wrote:
>Hello all,
>
>I am looking for advice on speeding up my Scanning.
>
>I want to iterate over all rows where a particular column (language)
>equals a particular value ("JA").
>
>I am already creating my row keys using that column in the first bytes.
>And I do my scans using partial row matching, like this...
>
> public static byte[] calculateStartRowKey(String language) {
> int languageHash = language.length() > 0 ? language.hashCode() :
>0;
> byte[] language2 = Bytes.toBytes(languageHash);
> byte[] accountID2 = Bytes.toBytes(0);
> byte[] timestamp2 = Bytes.toBytes(0);
> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
> }
>
> public static byte[] calculateEndRowKey(String language) {
> int languageHash = language.length() > 0 ? language.hashCode() :
>0;
> byte[] language2 = Bytes.toBytes(languageHash + 1);
> byte[] accountID2 = Bytes.toBytes(0);
> byte[] timestamp2 = Bytes.toBytes(0);
> return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
> }
>
> Scan scan = new Scan(calculateStartRowKey(language),
>calculateEndRowKey(language));
>
>
>Since I am using a hash value for the string, I need to re-check the
>column to make sure that some other string does not get the same hash
>value
>
> Filter filter = new SingleColumnValueFilter(resultFamily,
>languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
> scan.setFilter(filter);
>
>I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>EC2.
>
>I think that this should be really fast, but it is not. Any advice on
>how to debug/speed it up?
>
>Thanks
>Peter
>
>
>
>
>