You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by Peter Wolf <op...@gmail.com> on 2012/01/25 13:56:02 UTC

Speeding up Scans

Hello all,

I am looking for advice on speeding up my Scanning.

I want to iterate over all rows where a particular column (language) 
equals a particular value ("JA").

I am already creating my row keys using that column in the first bytes.  
And I do my scans using partial row matching, like this...

     public static byte[] calculateStartRowKey(String language) {
         int languageHash = language.length() > 0 ? language.hashCode() : 0;
         byte[] language2 = Bytes.toBytes(languageHash);
         byte[] accountID2 = Bytes.toBytes(0);
         byte[] timestamp2 = Bytes.toBytes(0);
         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
     }

     public static byte[] calculateEndRowKey(String language) {
         int languageHash = language.length() > 0 ? language.hashCode() : 0;
         byte[] language2 = Bytes.toBytes(languageHash + 1);
         byte[] accountID2 = Bytes.toBytes(0);
         byte[] timestamp2 = Bytes.toBytes(0);
         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
     }

     Scan scan = new Scan(calculateStartRowKey(language), 
calculateEndRowKey(language));


Since I am using a hash value for the string, I need to re-check the 
column to make sure that some other string does not get the same hash value

     Filter filter = new SingleColumnValueFilter(resultFamily, 
languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
     scan.setFilter(filter);

I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on EC2.

I think that this should be really fast, but it is not.  Any advice on 
how to debug/speed it up?

Thanks
Peter

Re: Speeding up Scans

Posted by Michael Segel <mi...@hotmail.com>.

I'm confused...
You mention that you are hashing your key, and you want to do a scan w a start and stop value?

Could you elaborate?

With respect to hashing, if you use a SHA-1 hash, your values will be unique.
(you talked about rehashing ...)

Sent from my iPhone

On Jan 25, 2012, at 7:56 AM, "Peter Wolf" <op...@gmail.com> wrote:

> Hello all,
> 
> I am looking for advice on speeding up my Scanning.
> 
> I want to iterate over all rows where a particular column (language) equals a particular value ("JA").
> 
> I am already creating my row keys using that column in the first bytes.  And I do my scans using partial row matching, like this...
> 
>    public static byte[] calculateStartRowKey(String language) {
>        int languageHash = language.length() > 0 ? language.hashCode() : 0;
>        byte[] language2 = Bytes.toBytes(languageHash);
>        byte[] accountID2 = Bytes.toBytes(0);
>        byte[] timestamp2 = Bytes.toBytes(0);
>        return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>    }
> 
>    public static byte[] calculateEndRowKey(String language) {
>        int languageHash = language.length() > 0 ? language.hashCode() : 0;
>        byte[] language2 = Bytes.toBytes(languageHash + 1);
>        byte[] accountID2 = Bytes.toBytes(0);
>        byte[] timestamp2 = Bytes.toBytes(0);
>        return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>    }
> 
>    Scan scan = new Scan(calculateStartRowKey(language), calculateEndRowKey(language));
> 
> 
> Since I am using a hash value for the string, I need to re-check the column to make sure that some other string does not get the same hash value
> 
>    Filter filter = new SingleColumnValueFilter(resultFamily, languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>    scan.setFilter(filter);
> 
> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on EC2.
> 
> I think that this should be really fast, but it is not.  Any advice on how to debug/speed it up?
> 
> Thanks
> Peter
> 
> 
> 
>

Re: Speeding up Scans

Posted by Jean-Daniel Cryans <jd...@apache.org>.

If you're running a full scan (what PE scan does) on a table that
doesn't fit in the block cache, setting setCacheBlocks(true) is the
last thing you want to do (unless you fancy getting massive cache
churn).

33k does sound awfully low.

J-D

On Thu, Jan 26, 2012 at 6:54 AM, Tim Robertson
<ti...@gmail.com> wrote:
> Hey Peter,
>
> I am trying to benchmark our 3 node cluster now and trying to optimize
> for scanning.
> Using the PerformanceEvaluation tool I did a random write to populate
> 5M rows (I believe they are 1k each but whatever the tool does by
> default).
>
> I am seeing 33k records per second (which I believe to be too low)
> with the following.
>    scan.setCacheBlocks(true);
>    scan.setCaching(10000);
>
> It might be worth using the PE
> (http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
> load, as then you are using a known table and content to compare
> against.
>
> I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS).
>
> HTH,
> Tim
>
>
>
> On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <op...@gmail.com> wrote:
>> Thank you Doug and Geoff,
>>
>> After following your advice I am now up to about 100 rows a second.  Is that
>> considered fast for HBase?
>>
>> My data is not big, and I only have 100,000's of rows in my table at the
>> moment.
>>
>> Do I still have a tuning problem?  How fast should I expect?
>>
>> Thanks
>>
>> Peter
>>
>>
>>
>> On 1/25/12 2:32 PM, Doug Meil wrote:
>>>
>>> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
>>> book with that param.
>>>
>>>
>>>
>>>
>>> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com>  wrote:
>>>
>>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>>> in some findings  (reported to this list by us a while back) that helped
>>>> us to get scans to perform very fast. Adjusting
>>>> hbase.client.prefetch.limit was critical for us.:
>>>> ========================
>>>> It's even more mysterious than we think. There is lack of documentation
>>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>>> decide the performance of scan.
>>>>
>>>> 1.      Scanner cache as we know - We always had scanner caching set to
>>>> 1, but this is different than pre fetch limit
>>>> 2.      hbase.client.prefetch.limit -  This is meta caching limit
>>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>>> is not already been pre-warmed
>>>>
>>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>>> prefetch the next 10 region locations.
>>>>
>>>> int rows = Math.min(rowLimit,
>>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>>
>>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>>> boundaries. Hence every new region boundary that is not already been
>>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>>> query followed by quick responses. This is basically pre-warming the
>>>> meta not region cache.
>>>>
>>>> -----Original Message-----
>>>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>>> To: user@hbase.apache.org
>>>> Subject: Re: Speeding up Scans
>>>>
>>>> Does it make sense to have better defaults so the performance out of the
>>>> box is better?
>>>>
>>>> ~Jeff
>>>>
>>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>>
>>>>> Ah ha!  I appear to be insane ;-)
>>>>>
>>>>> Adding the following speeded things up quite a bit
>>>>>
>>>>>         scan.setCacheBlocks(true);
>>>>>         scan.setCaching(1000);
>>>>>
>>>>> Thank you, it was a duh!
>>>>>
>>>>> P
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>>
>>>>>> Hi there-
>>>>>>
>>>>>> Quick sanity check:  what caching level are you using?  (default is
>>>>
>>>> 1)  I
>>>>>>
>>>>>> know this is basic, but it's always good to double-check.
>>>>>>
>>>>>> If "language" is already in the lead position of the rowkey, why use
>>>>
>>>> the
>>>>>>
>>>>>> filter?
>>>>>>
>>>>>> As for EC2, that's a wildcard.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>   wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>>
>>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>>> equals a particular value ("JA").
>>>>>>>
>>>>>>> I am already creating my row keys using that column in the first
>>>>
>>>> bytes.
>>>>>>>
>>>>>>> And I do my scans using partial row matching, like this...
>>>>>>>
>>>>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>>>>          int languageHash = language.length()>   0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>>      }
>>>>>>>
>>>>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>>>>          int languageHash = language.length()>   0 ?
>>>>
>>>> language.hashCode() :
>>>>>>>
>>>>>>> 0;
>>>>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>
>>>> timestamp2);
>>>>>>>
>>>>>>>      }
>>>>>>>
>>>>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>>>>> calculateEndRowKey(language));
>>>>>>>
>>>>>>>
>>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>>> column to make sure that some other string does not get the same
>>>>
>>>> hash
>>>>>>>
>>>>>>> value
>>>>>>>
>>>>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>>
>>>> Bytes.toBytes(language));
>>>>>>>
>>>>>>>      scan.setFilter(filter);
>>>>>>>
>>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>>
>>>> on
>>>>>>>
>>>>>>> EC2.
>>>>>>>
>>>>>>> I think that this should be really fast, but it is not.  Any advice
>>>>
>>>> on
>>>>>>>
>>>>>>> how to debug/speed it up?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Peter
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>> --
>>>> Jeff Whiting
>>>> Qualtrics Senior Software Engineer
>>>> jeffw@qualtrics.com
>>>>
>>>>
>>>
>>

Re: Speeding up Scans

Posted by Tim Robertson <ti...@gmail.com>.

Hey Peter,

I am trying to benchmark our 3 node cluster now and trying to optimize
for scanning.
Using the PerformanceEvaluation tool I did a random write to populate
5M rows (I believe they are 1k each but whatever the tool does by
default).

I am seeing 33k records per second (which I believe to be too low)
with the following.
    scan.setCacheBlocks(true);
    scan.setCaching(10000);

It might be worth using the PE
(http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation) tool to
load, as then you are using a known table and content to compare
against.

I am running a 3 node cluser (2xquad core, 6x250G SATA, 24GB men with 6G on RS).

HTH,
Tim



On Thu, Jan 26, 2012 at 3:39 PM, Peter Wolf <op...@gmail.com> wrote:
> Thank you Doug and Geoff,
>
> After following your advice I am now up to about 100 rows a second.  Is that
> considered fast for HBase?
>
> My data is not big, and I only have 100,000's of rows in my table at the
> moment.
>
> Do I still have a tuning problem?  How fast should I expect?
>
> Thanks
>
> Peter
>
>
>
> On 1/25/12 2:32 PM, Doug Meil wrote:
>>
>> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
>> book with that param.
>>
>>
>>
>>
>> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com>  wrote:
>>
>>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>>> in some findings  (reported to this list by us a while back) that helped
>>> us to get scans to perform very fast. Adjusting
>>> hbase.client.prefetch.limit was critical for us.:
>>> ========================
>>> It's even more mysterious than we think. There is lack of documentation
>>> (or perhaps lack of know how). Apparently there are 2 factors that
>>> decide the performance of scan.
>>>
>>> 1.      Scanner cache as we know - We always had scanner caching set to
>>> 1, but this is different than pre fetch limit
>>> 2.      hbase.client.prefetch.limit -  This is meta caching limit
>>> defaults to 10 to prefetch 10 region locations every time we scan that
>>> is not already been pre-warmed
>>>
>>> the "hbase.client.prefetch.limit" is passed along to the client code to
>>> prefetch the next 10 region locations.
>>>
>>> int rows = Math.min(rowLimit,
>>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>>
>>> the "row" variable mins to 10 and always prefetch atmost 10 region
>>> boundaries. Hence every new region boundary that is not already been
>>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>>> query followed by quick responses. This is basically pre-warming the
>>> meta not region cache.
>>>
>>> -----Original Message-----
>>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>>> Sent: Wednesday, January 25, 2012 10:09 AM
>>> To: user@hbase.apache.org
>>> Subject: Re: Speeding up Scans
>>>
>>> Does it make sense to have better defaults so the performance out of the
>>> box is better?
>>>
>>> ~Jeff
>>>
>>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>>>
>>>> Ah ha!  I appear to be insane ;-)
>>>>
>>>> Adding the following speeded things up quite a bit
>>>>
>>>>         scan.setCacheBlocks(true);
>>>>         scan.setCaching(1000);
>>>>
>>>> Thank you, it was a duh!
>>>>
>>>> P
>>>>
>>>>
>>>>
>>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>>>
>>>>> Hi there-
>>>>>
>>>>> Quick sanity check:  what caching level are you using?  (default is
>>>
>>> 1)  I
>>>>>
>>>>> know this is basic, but it's always good to double-check.
>>>>>
>>>>> If "language" is already in the lead position of the rowkey, why use
>>>
>>> the
>>>>>
>>>>> filter?
>>>>>
>>>>> As for EC2, that's a wildcard.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>   wrote:
>>>>>
>>>>>> Hello all,
>>>>>>
>>>>>> I am looking for advice on speeding up my Scanning.
>>>>>>
>>>>>> I want to iterate over all rows where a particular column (language)
>>>>>> equals a particular value ("JA").
>>>>>>
>>>>>> I am already creating my row keys using that column in the first
>>>
>>> bytes.
>>>>>>
>>>>>> And I do my scans using partial row matching, like this...
>>>>>>
>>>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>>>          int languageHash = language.length()>   0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>>      }
>>>>>>
>>>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>>>          int languageHash = language.length()>   0 ?
>>>
>>> language.hashCode() :
>>>>>>
>>>>>> 0;
>>>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>
>>> timestamp2);
>>>>>>
>>>>>>      }
>>>>>>
>>>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>>>> calculateEndRowKey(language));
>>>>>>
>>>>>>
>>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>>> column to make sure that some other string does not get the same
>>>
>>> hash
>>>>>>
>>>>>> value
>>>>>>
>>>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>>>
>>> Bytes.toBytes(language));
>>>>>>
>>>>>>      scan.setFilter(filter);
>>>>>>
>>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>>>
>>> on
>>>>>>
>>>>>> EC2.
>>>>>>
>>>>>> I think that this should be really fast, but it is not.  Any advice
>>>
>>> on
>>>>>>
>>>>>> how to debug/speed it up?
>>>>>>
>>>>>> Thanks
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>
>>
>

Re: Speeding up Scans

Posted by Peter Wolf <op...@gmail.com>.

Thank you Doug and Geoff,

After following your advice I am now up to about 100 rows a second.  Is 
that considered fast for HBase?

My data is not big, and I only have 100,000's of rows in my table at the 
moment.

Do I still have a tuning problem?  How fast should I expect?

Thanks
Peter



On 1/25/12 2:32 PM, Doug Meil wrote:
> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
> book with that param.
>
>
>
>
> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com>  wrote:
>
>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>> in some findings  (reported to this list by us a while back) that helped
>> us to get scans to perform very fast. Adjusting
>> hbase.client.prefetch.limit was critical for us.:
>> ========================
>> It's even more mysterious than we think. There is lack of documentation
>> (or perhaps lack of know how). Apparently there are 2 factors that
>> decide the performance of scan.
>>
>> 1.	Scanner cache as we know - We always had scanner caching set to
>> 1, but this is different than pre fetch limit
>> 2.	hbase.client.prefetch.limit -  This is meta caching limit
>> defaults to 10 to prefetch 10 region locations every time we scan that
>> is not already been pre-warmed
>>
>> the "hbase.client.prefetch.limit" is passed along to the client code to
>> prefetch the next 10 region locations.
>>
>> int rows = Math.min(rowLimit,
>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>
>> the "row" variable mins to 10 and always prefetch atmost 10 region
>> boundaries. Hence every new region boundary that is not already been
>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>> query followed by quick responses. This is basically pre-warming the
>> meta not region cache.
>>
>> -----Original Message-----
>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>> Sent: Wednesday, January 25, 2012 10:09 AM
>> To: user@hbase.apache.org
>> Subject: Re: Speeding up Scans
>>
>> Does it make sense to have better defaults so the performance out of the
>> box is better?
>>
>> ~Jeff
>>
>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>> Ah ha!  I appear to be insane ;-)
>>>
>>> Adding the following speeded things up quite a bit
>>>
>>>          scan.setCacheBlocks(true);
>>>          scan.setCaching(1000);
>>>
>>> Thank you, it was a duh!
>>>
>>> P
>>>
>>>
>>>
>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>> Hi there-
>>>>
>>>> Quick sanity check:  what caching level are you using?  (default is
>> 1)  I
>>>> know this is basic, but it's always good to double-check.
>>>>
>>>> If "language" is already in the lead position of the rowkey, why use
>> the
>>>> filter?
>>>>
>>>> As for EC2, that's a wildcard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>   wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am looking for advice on speeding up my Scanning.
>>>>>
>>>>> I want to iterate over all rows where a particular column (language)
>>>>> equals a particular value ("JA").
>>>>>
>>>>> I am already creating my row keys using that column in the first
>> bytes.
>>>>> And I do my scans using partial row matching, like this...
>>>>>
>>>>>       public static byte[] calculateStartRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       public static byte[] calculateEndRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       Scan scan = new Scan(calculateStartRowKey(language),
>>>>> calculateEndRowKey(language));
>>>>>
>>>>>
>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>> column to make sure that some other string does not get the same
>> hash
>>>>> value
>>>>>
>>>>>       Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(language));
>>>>>       scan.setFilter(filter);
>>>>>
>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>> on
>>>>> EC2.
>>>>>
>>>>> I think that this should be really fast, but it is not.  Any advice
>> on
>>>>> how to debug/speed it up?
>>>>>
>>>>> Thanks
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> -- 
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>

Re: Speeding up Scans

Posted by Peter Wolf <op...@gmail.com>.

Interesting,

I added this, and my scan did speed up somewhat

         conf.setInt("hbase.client.prefetch.limit",100);
         hTable = new HTable(conf, tableName);


What does this environment variable really control, and how should it be 
set to an appropriate value?  What is a region, and how does it map to 
lines, families and columns?  What are the tradeoffs for making it big?

Peter



On 1/25/12 2:32 PM, Doug Meil wrote:
> Thanks Geoff!  No apology required, that's good stuff.  I'll update the
> book with that param.
>
>
>
>
> On 1/25/12 2:17 PM, "Geoff Hendrey"<gh...@decarta.com>  wrote:
>
>> Sorry for jumping in late, and perhaps out of context, but I'm pasting
>> in some findings  (reported to this list by us a while back) that helped
>> us to get scans to perform very fast. Adjusting
>> hbase.client.prefetch.limit was critical for us.:
>> ========================
>> It's even more mysterious than we think. There is lack of documentation
>> (or perhaps lack of know how). Apparently there are 2 factors that
>> decide the performance of scan.
>>
>> 1.	Scanner cache as we know - We always had scanner caching set to
>> 1, but this is different than pre fetch limit
>> 2.	hbase.client.prefetch.limit -  This is meta caching limit
>> defaults to 10 to prefetch 10 region locations every time we scan that
>> is not already been pre-warmed
>>
>> the "hbase.client.prefetch.limit" is passed along to the client code to
>> prefetch the next 10 region locations.
>>
>> int rows = Math.min(rowLimit,
>> configuration.getInt("hbase.meta.scanner.caching", 100));
>>
>> the "row" variable mins to 10 and always prefetch atmost 10 region
>> boundaries. Hence every new region boundary that is not already been
>> pre-warmed fetch the next 10 region locations resulting in 1st slow
>> query followed by quick responses. This is basically pre-warming the
>> meta not region cache.
>>
>> -----Original Message-----
>> From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>> Sent: Wednesday, January 25, 2012 10:09 AM
>> To: user@hbase.apache.org
>> Subject: Re: Speeding up Scans
>>
>> Does it make sense to have better defaults so the performance out of the
>> box is better?
>>
>> ~Jeff
>>
>> On 1/25/2012 8:06 AM, Peter Wolf wrote:
>>> Ah ha!  I appear to be insane ;-)
>>>
>>> Adding the following speeded things up quite a bit
>>>
>>>          scan.setCacheBlocks(true);
>>>          scan.setCaching(1000);
>>>
>>> Thank you, it was a duh!
>>>
>>> P
>>>
>>>
>>>
>>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>>> Hi there-
>>>>
>>>> Quick sanity check:  what caching level are you using?  (default is
>> 1)  I
>>>> know this is basic, but it's always good to double-check.
>>>>
>>>> If "language" is already in the lead position of the rowkey, why use
>> the
>>>> filter?
>>>>
>>>> As for EC2, that's a wildcard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>   wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> I am looking for advice on speeding up my Scanning.
>>>>>
>>>>> I want to iterate over all rows where a particular column (language)
>>>>> equals a particular value ("JA").
>>>>>
>>>>> I am already creating my row keys using that column in the first
>> bytes.
>>>>> And I do my scans using partial row matching, like this...
>>>>>
>>>>>       public static byte[] calculateStartRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       public static byte[] calculateEndRowKey(String language) {
>>>>>           int languageHash = language.length()>   0 ?
>> language.hashCode() :
>>>>> 0;
>>>>>           byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>>           byte[] accountID2 = Bytes.toBytes(0);
>>>>>           byte[] timestamp2 = Bytes.toBytes(0);
>>>>>           return Bytes.add(Bytes.add(language2, accountID2),
>> timestamp2);
>>>>>       }
>>>>>
>>>>>       Scan scan = new Scan(calculateStartRowKey(language),
>>>>> calculateEndRowKey(language));
>>>>>
>>>>>
>>>>> Since I am using a hash value for the string, I need to re-check the
>>>>> column to make sure that some other string does not get the same
>> hash
>>>>> value
>>>>>
>>>>>       Filter filter = new SingleColumnValueFilter(resultFamily,
>>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>> Bytes.toBytes(language));
>>>>>       scan.setFilter(filter);
>>>>>
>>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>> on
>>>>> EC2.
>>>>>
>>>>> I think that this should be really fast, but it is not.  Any advice
>> on
>>>>> how to debug/speed it up?
>>>>>
>>>>> Thanks
>>>>> Peter
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>> -- 
>> Jeff Whiting
>> Qualtrics Senior Software Engineer
>> jeffw@qualtrics.com
>>
>>
>

Re: Speeding up Scans

Posted by Doug Meil <do...@explorysmedical.com>.

Thanks Geoff!  No apology required, that's good stuff.  I'll update the
book with that param.




On 1/25/12 2:17 PM, "Geoff Hendrey" <gh...@decarta.com> wrote:

>Sorry for jumping in late, and perhaps out of context, but I'm pasting
>in some findings  (reported to this list by us a while back) that helped
>us to get scans to perform very fast. Adjusting
>hbase.client.prefetch.limit was critical for us.:
>========================
>It's even more mysterious than we think. There is lack of documentation
>(or perhaps lack of know how). Apparently there are 2 factors that
>decide the performance of scan.
>
>1.	Scanner cache as we know - We always had scanner caching set to
>1, but this is different than pre fetch limit
>2.	hbase.client.prefetch.limit -  This is meta caching limit
>defaults to 10 to prefetch 10 region locations every time we scan that
>is not already been pre-warmed
>
>the "hbase.client.prefetch.limit" is passed along to the client code to
>prefetch the next 10 region locations.
>
>int rows = Math.min(rowLimit,
>configuration.getInt("hbase.meta.scanner.caching", 100));
>
>the "row" variable mins to 10 and always prefetch atmost 10 region
>boundaries. Hence every new region boundary that is not already been
>pre-warmed fetch the next 10 region locations resulting in 1st slow
>query followed by quick responses. This is basically pre-warming the
>meta not region cache.
>
>-----Original Message-----
>From: Jeff Whiting [mailto:jeffw@qualtrics.com]
>Sent: Wednesday, January 25, 2012 10:09 AM
>To: user@hbase.apache.org
>Subject: Re: Speeding up Scans
>
>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha!  I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>>         scan.setCacheBlocks(true);
>>         scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check:  what caching level are you using?  (default is
>1)  I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>>      }
>>>>
>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>timestamp2);
>>>>      }
>>>>
>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same
>hash
>>>> value
>>>>
>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL,
>Bytes.toBytes(language));
>>>>      scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
>on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not.  Any advice
>on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>-- 
>Jeff Whiting
>Qualtrics Senior Software Engineer
>jeffw@qualtrics.com
>
>

RE: Speeding up Scans

Posted by Geoff Hendrey <gh...@decarta.com>.

Sorry for jumping in late, and perhaps out of context, but I'm pasting
in some findings  (reported to this list by us a while back) that helped
us to get scans to perform very fast. Adjusting
hbase.client.prefetch.limit was critical for us.:
========================
It's even more mysterious than we think. There is lack of documentation
(or perhaps lack of know how). Apparently there are 2 factors that
decide the performance of scan. 

1.	Scanner cache as we know - We always had scanner caching set to
1, but this is different than pre fetch limit
2.	hbase.client.prefetch.limit -  This is meta caching limit
defaults to 10 to prefetch 10 region locations every time we scan that
is not already been pre-warmed 

the "hbase.client.prefetch.limit" is passed along to the client code to
prefetch the next 10 region locations.

int rows = Math.min(rowLimit,
configuration.getInt("hbase.meta.scanner.caching", 100));

the "row" variable mins to 10 and always prefetch atmost 10 region
boundaries. Hence every new region boundary that is not already been
pre-warmed fetch the next 10 region locations resulting in 1st slow
query followed by quick responses. This is basically pre-warming the
meta not region cache.

-----Original Message-----
From: Jeff Whiting [mailto:jeffw@qualtrics.com] 
Sent: Wednesday, January 25, 2012 10:09 AM
To: user@hbase.apache.org
Subject: Re: Speeding up Scans

Does it make sense to have better defaults so the performance out of the
box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha!  I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is
1)  I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use
the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first
bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same
hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL,
Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines
on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice
on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com

Re: Speeding up Scans

Posted by Doug Meil <do...@explorysmedical.com>.

I think this is one of those "damned if you do..." situations.  If you
want to do a lot of quick single-record lookups (a Get is actually a Scan
underneath the covers), then "1" is what you want.  But for MapReduce
jobs, or for scanning over a wide number of records like you're doing,
then you'll want the value higher.




On 1/25/12 1:09 PM, "Jeff Whiting" <je...@qualtrics.com> wrote:

>Does it make sense to have better defaults so the performance out of the
>box is better?
>
>~Jeff
>
>On 1/25/2012 8:06 AM, Peter Wolf wrote:
>> Ah ha!  I appear to be insane ;-)
>>
>> Adding the following speeded things up quite a bit
>>
>>         scan.setCacheBlocks(true);
>>         scan.setCaching(1000);
>>
>> Thank you, it was a duh!
>>
>> P
>>
>>
>>
>> On 1/25/12 8:13 AM, Doug Meil wrote:
>>> Hi there-
>>>
>>> Quick sanity check:  what caching level are you using?  (default is 1)
>>> I
>>> know this is basic, but it's always good to double-check.
>>>
>>> If "language" is already in the lead position of the rowkey, why use
>>>the
>>> filter?
>>>
>>> As for EC2, that's a wildcard.
>>>
>>>
>>>
>>>
>>>
>>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am looking for advice on speeding up my Scanning.
>>>>
>>>> I want to iterate over all rows where a particular column (language)
>>>> equals a particular value ("JA").
>>>>
>>>> I am already creating my row keys using that column in the first
>>>>bytes.
>>>> And I do my scans using partial row matching, like this...
>>>>
>>>>      public static byte[] calculateStartRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>>>>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>>      }
>>>>
>>>>      public static byte[] calculateEndRowKey(String language) {
>>>>          int languageHash = language.length()>  0 ?
>>>>language.hashCode() :
>>>> 0;
>>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>>timestamp2);
>>>>      }
>>>>
>>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>>> calculateEndRowKey(language));
>>>>
>>>>
>>>> Since I am using a hash value for the string, I need to re-check the
>>>> column to make sure that some other string does not get the same hash
>>>> value
>>>>
>>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>>      scan.setFilter(filter);
>>>>
>>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>>> EC2.
>>>>
>>>> I think that this should be really fast, but it is not.  Any advice on
>>>> how to debug/speed it up?
>>>>
>>>> Thanks
>>>> Peter
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>-- 
>Jeff Whiting
>Qualtrics Senior Software Engineer
>jeffw@qualtrics.com
>
>

Re: Speeding up Scans

Posted by Jeff Whiting <je...@qualtrics.com>.

Does it make sense to have better defaults so the performance out of the box is better?

~Jeff

On 1/25/2012 8:06 AM, Peter Wolf wrote:
> Ah ha!  I appear to be insane ;-)
>
> Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
> Thank you, it was a duh!
>
> P
>
>
>
> On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is 1)  I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ? language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ? language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com

Re: Speeding up Scans

Posted by Doug Meil <do...@explorysmedical.com>.

No problem!  That's one of the tips in the Performance chapter of the
book/refGuide - always a good thing to double-check because even the most
experienced folks sometimes forget the simple stuff.



On 1/25/12 10:06 AM, "Peter Wolf" <op...@gmail.com> wrote:

>Ah ha!  I appear to be insane ;-)
>
>Adding the following speeded things up quite a bit
>
>         scan.setCacheBlocks(true);
>         scan.setCaching(1000);
>
>Thank you, it was a duh!
>
>P
>
>
>
>On 1/25/12 8:13 AM, Doug Meil wrote:
>> Hi there-
>>
>> Quick sanity check:  what caching level are you using?  (default is 1)
>>I
>> know this is basic, but it's always good to double-check.
>>
>> If "language" is already in the lead position of the rowkey, why use the
>> filter?
>>
>> As for EC2, that's a wildcard.
>>
>>
>>
>>
>>
>> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>>
>>> Hello all,
>>>
>>> I am looking for advice on speeding up my Scanning.
>>>
>>> I want to iterate over all rows where a particular column (language)
>>> equals a particular value ("JA").
>>>
>>> I am already creating my row keys using that column in the first bytes.
>>> And I do my scans using partial row matching, like this...
>>>
>>>      public static byte[] calculateStartRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
>>>language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>>      }
>>>
>>>      public static byte[] calculateEndRowKey(String language) {
>>>          int languageHash = language.length()>  0 ?
>>>language.hashCode() :
>>> 0;
>>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>>          byte[] accountID2 = Bytes.toBytes(0);
>>>          byte[] timestamp2 = Bytes.toBytes(0);
>>>          return Bytes.add(Bytes.add(language2, accountID2),
>>>timestamp2);
>>>      }
>>>
>>>      Scan scan = new Scan(calculateStartRowKey(language),
>>> calculateEndRowKey(language));
>>>
>>>
>>> Since I am using a hash value for the string, I need to re-check the
>>> column to make sure that some other string does not get the same hash
>>> value
>>>
>>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>>      scan.setFilter(filter);
>>>
>>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>>> EC2.
>>>
>>> I think that this should be really fast, but it is not.  Any advice on
>>> how to debug/speed it up?
>>>
>>> Thanks
>>> Peter
>>>
>>>
>>>
>>>
>>>
>>
>
>

Re: Speeding up Scans

Posted by Peter Wolf <op...@gmail.com>.

Ah ha!  I appear to be insane ;-)

Adding the following speeded things up quite a bit

         scan.setCacheBlocks(true);
         scan.setCaching(1000);

Thank you, it was a duh!

P



On 1/25/12 8:13 AM, Doug Meil wrote:
> Hi there-
>
> Quick sanity check:  what caching level are you using?  (default is 1)  I
> know this is basic, but it's always good to double-check.
>
> If "language" is already in the lead position of the rowkey, why use the
> filter?
>
> As for EC2, that's a wildcard.
>
>
>
>
>
> On 1/25/12 7:56 AM, "Peter Wolf"<op...@gmail.com>  wrote:
>
>> Hello all,
>>
>> I am looking for advice on speeding up my Scanning.
>>
>> I want to iterate over all rows where a particular column (language)
>> equals a particular value ("JA").
>>
>> I am already creating my row keys using that column in the first bytes.
>> And I do my scans using partial row matching, like this...
>>
>>      public static byte[] calculateStartRowKey(String language) {
>>          int languageHash = language.length()>  0 ? language.hashCode() :
>> 0;
>>          byte[] language2 = Bytes.toBytes(languageHash);
>>          byte[] accountID2 = Bytes.toBytes(0);
>>          byte[] timestamp2 = Bytes.toBytes(0);
>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>      }
>>
>>      public static byte[] calculateEndRowKey(String language) {
>>          int languageHash = language.length()>  0 ? language.hashCode() :
>> 0;
>>          byte[] language2 = Bytes.toBytes(languageHash + 1);
>>          byte[] accountID2 = Bytes.toBytes(0);
>>          byte[] timestamp2 = Bytes.toBytes(0);
>>          return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>>      }
>>
>>      Scan scan = new Scan(calculateStartRowKey(language),
>> calculateEndRowKey(language));
>>
>>
>> Since I am using a hash value for the string, I need to re-check the
>> column to make sure that some other string does not get the same hash
>> value
>>
>>      Filter filter = new SingleColumnValueFilter(resultFamily,
>> languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>>      scan.setFilter(filter);
>>
>> I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>> EC2.
>>
>> I think that this should be really fast, but it is not.  Any advice on
>> how to debug/speed it up?
>>
>> Thanks
>> Peter
>>
>>
>>
>>
>>
>

Re: Speeding up Scans

Posted by Doug Meil <do...@explorysmedical.com>.

Hi there-

Quick sanity check:  what caching level are you using?  (default is 1)  I
know this is basic, but it's always good to double-check.

If "language" is already in the lead position of the rowkey, why use the
filter?

As for EC2, that's a wildcard.





On 1/25/12 7:56 AM, "Peter Wolf" <op...@gmail.com> wrote:

>Hello all,
>
>I am looking for advice on speeding up my Scanning.
>
>I want to iterate over all rows where a particular column (language)
>equals a particular value ("JA").
>
>I am already creating my row keys using that column in the first bytes.
>And I do my scans using partial row matching, like this...
>
>     public static byte[] calculateStartRowKey(String language) {
>         int languageHash = language.length() > 0 ? language.hashCode() :
>0;
>         byte[] language2 = Bytes.toBytes(languageHash);
>         byte[] accountID2 = Bytes.toBytes(0);
>         byte[] timestamp2 = Bytes.toBytes(0);
>         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>     }
>
>     public static byte[] calculateEndRowKey(String language) {
>         int languageHash = language.length() > 0 ? language.hashCode() :
>0;
>         byte[] language2 = Bytes.toBytes(languageHash + 1);
>         byte[] accountID2 = Bytes.toBytes(0);
>         byte[] timestamp2 = Bytes.toBytes(0);
>         return Bytes.add(Bytes.add(language2, accountID2), timestamp2);
>     }
>
>     Scan scan = new Scan(calculateStartRowKey(language),
>calculateEndRowKey(language));
>
>
>Since I am using a hash value for the string, I need to re-check the
>column to make sure that some other string does not get the same hash
>value
>
>     Filter filter = new SingleColumnValueFilter(resultFamily,
>languageCol, CompareFilter.CompareOp.EQUAL, Bytes.toBytes(language));
>     scan.setFilter(filter);
>
>I am using the Cloudera 0.09.4 release, and a cluster of 3 machines on
>EC2.
>
>I think that this should be really fast, but it is not.  Any advice on
>how to debug/speed it up?
>
>Thanks
>Peter
>
>
>
>
>