You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Vivek Krishna <vi...@gmail.com> on 2011/03/16 21:35:54 UTC

Row Counters

1.  How do I count rows fast in hbase?

First I tired count 'test'  , takes ages.

Saw that I could use RowCounter, but looks like it is deprecated.  When I
try to use it, I get

java.io.IOException: Cannot create a record reader because of a previous
error. Please look at the previous logs lines from the task's full log for
more details.
at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)

If this is deprecated, is there any other way of finding the counts?

I just need to verify the total counts.  Is it possible to see somewhere in
the web interface or ganglia or by any other means?

Viv

Re: Row Counters

Posted by Jeff Whiting <je...@qualtrics.com>.
Thanks for the explanation.  Makes perfect sense now that you've explained it.  That would incur a 
huge write overhead so I see whey we don't keep the counts.

~Jeff

On 3/16/2011 2:59 PM, Matt Corgan wrote:
> Jeff,
>
> The problem is that when hbase receives a put or delete, it doesn't know if
> the put is overwriting an existing row or inserting a new one, and it
> doesn't know if whether the requested row was there to delete.  This isn't
> known until read or compaction time.
>
> So to keep the counter up to date on every insert, it would have to check
> all of the region's storefiles which would slow down your inserts a lot.
>
> Matt
>
>
> On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu<yu...@gmail.com>  wrote:
>
>> Since we have lived so long without this information, I guess we can hold
>> for longer :-)
>> Another issue I am working on is to reduce memory footprint. See the
>> following discussion thread:
>> One of the regionserver aborted, then the master shut down itself
>>
>> We have to bear in mind that there would be around 10K regions or more in
>> production.
>>
>> Cheers
>>
>> On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting<je...@qualtrics.com>  wrote:
>>
>>> Just a random thought.  What about keeping a per region row count?  Then
>> if
>>> you needed to get a row count for a table you'd just have to query each
>>> region once and sum.  Seems like it wouldn't be too expensive because
>> you'd
>>> just have a row counter variable.  It maybe more complicated than I'm
>> making
>>> it out to be though...
>>>
>>> ~Jeff
>>>
>>>
>>> On 3/16/2011 2:40 PM, Stack wrote:
>>>
>>>> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>
>>>>   wrote:
>>>>
>>>>> 1.  How do I count rows fast in hbase?
>>>>>
>>>>> First I tired count 'test'  , takes ages.
>>>>>
>>>>> Saw that I could use RowCounter, but looks like it is deprecated.
>>>>>
>>>> It is not.  Make sure you are using the one from mapreduce package as
>>>> opposed to mapred package.
>>>>
>>>>
>>>>   I just need to verify the total counts.  Is it possible to see
>> somewhere
>>>>> in
>>>>> the web interface or ganglia or by any other means?
>>>>>
>>>>>   We don't keep a current count on a table.  Too expensive.  Run the
>>>> rowcounter MR job.  This page may be of help:
>>>>
>>>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>>>> Good luck,
>>>> St.Ack
>>>>
>>> --
>>> Jeff Whiting
>>> Qualtrics Senior Software Engineer
>>> jeffw@qualtrics.com
>>>
>>>

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Re: Row Counters

Posted by Matt Corgan <mc...@hotpads.com>.
Jeff,

The problem is that when hbase receives a put or delete, it doesn't know if
the put is overwriting an existing row or inserting a new one, and it
doesn't know if whether the requested row was there to delete.  This isn't
known until read or compaction time.

So to keep the counter up to date on every insert, it would have to check
all of the region's storefiles which would slow down your inserts a lot.

Matt


On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:

> Since we have lived so long without this information, I guess we can hold
> for longer :-)
> Another issue I am working on is to reduce memory footprint. See the
> following discussion thread:
> One of the regionserver aborted, then the master shut down itself
>
> We have to bear in mind that there would be around 10K regions or more in
> production.
>
> Cheers
>
> On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com> wrote:
>
> > Just a random thought.  What about keeping a per region row count?  Then
> if
> > you needed to get a row count for a table you'd just have to query each
> > region once and sum.  Seems like it wouldn't be too expensive because
> you'd
> > just have a row counter variable.  It maybe more complicated than I'm
> making
> > it out to be though...
> >
> > ~Jeff
> >
> >
> > On 3/16/2011 2:40 PM, Stack wrote:
> >
> >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>
> >>  wrote:
> >>
> >>> 1.  How do I count rows fast in hbase?
> >>>
> >>> First I tired count 'test'  , takes ages.
> >>>
> >>> Saw that I could use RowCounter, but looks like it is deprecated.
> >>>
> >> It is not.  Make sure you are using the one from mapreduce package as
> >> opposed to mapred package.
> >>
> >>
> >>  I just need to verify the total counts.  Is it possible to see
> somewhere
> >>> in
> >>> the web interface or ganglia or by any other means?
> >>>
> >>>  We don't keep a current count on a table.  Too expensive.  Run the
> >> rowcounter MR job.  This page may be of help:
> >>
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> >>
> >> Good luck,
> >> St.Ack
> >>
> >
> > --
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > jeffw@qualtrics.com
> >
> >
>

Re: OT - Hash Code Creation

Posted by Ted Dunning <td...@maprtech.com>.
On Thu, Mar 17, 2011 at 8:21 AM, Michael Segel <mi...@hotmail.com>wrote:

>
> Why not keep it simple?
>
> Use a SHA-1 hash of your key. See:
> http://codelog.blogial.com/2008/09/13/password-encryption-using-sha1-md5-java/
> (This was just the first one I found and there are others...)
>

Sha-1 is kind of slow.


>
> So as long as your key is unique, the sha-1 hash should also be unique.
>

Pretty much.  But you can get comparable performance with simpler hashes.
 These simpler hashes are very widely available even if not part of java.

RE: OT - Hash Code Creation

Posted by Michael Segel <mi...@hotmail.com>.
Why not keep it simple?

Use a SHA-1 hash of your key. See: http://codelog.blogial.com/2008/09/13/password-encryption-using-sha1-md5-java/
(This was just the first one I found and there are others...)

So as long as your key is unique, the sha-1 hash should also be unique.

The reason I suggest sha-1 is that it ships as part of Java SE (security package I think) so its always there and its unique enough.
(While the chances of a collision are theoretically possible, no one has found one yet. You could be the first. :-)  )

HTH

-Mike


> From: tdunning@maprtech.com
> Date: Thu, 17 Mar 2011 00:23:00 -0700
> Subject: Re: OT - Hash Code Creation
> To: user@hbase.apache.org
> CC: octo47@gmail.com
> 
> Double hashing is a find thing.  To actually answer the question, though, I
> would recommend Murmurhash or JOAAT (
> http://en.wikipedia.org/wiki/Jenkins_hash_function)
> 
> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com> wrote:
> 
> > Try hash table with double hashing.
> > Something like this
> >
> > http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
> >
> > 2011/3/17 Peter Haidinyak <ph...@local.com>
> >
> > > Hi,
> > >        This is a little off topic but this group seems pretty swift so I
> > > thought I would ask. I am aggregating a day's worth of log data which
> > means
> > > I have a Map of over 24 million elements. What would be a good algorithm
> > to
> > > use for generating Hash Codes for these elements that cut down on
> > > collisions? I application starts out reading in a log (144 logs in all)
> > in
> > > about 20 seconds and by the time I reach the last log it is taking around
> > > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
> > > I've played around with different Hashing algorithms and cut the original
> > > time from over 300 seconds to 120 but I know I can do better.
> > > The key I am using for the Map is an alpha-numeric string that is
> > > approximately 16 character long with the last 4 or 5 character being the
> > > most unique.
> > >
> > > Any ideas?
> > >
> > > Thanks
> > >
> > > -Pete
> > >
> >
 		 	   		  

Re: OT - Hash Code Creation

Posted by Chris Tarnas <cf...@email.com>.
Ok - now I understand - doing pre-splits using the full binary space does not make sense when using a limited range. I do all my splits in the base-64 character space or let hbase do them organically.
thanks for the explanation.
-chris


On Mar 17, 2011, at 11:32 AM, Ted Dunning wrote:

> Just that base-64 is not uniformly distributed relative to a binary representation.  This is simply  because it is all printable characters.  If you do a 256 way pre-split based on a binary interpretation of the key, 64 regions will get traffic and 192 will get none.  Among other things, this can seriously mess up benchmarking.  The situation is even worse with decimal integer representations.
> 
> On Thu, Mar 17, 2011 at 11:19 AM, Chris Tarnas <cf...@email.com> wrote:
> I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not uniformly distributed? Or that a base-64 encoding is not evenly distributed?
> 
> thanks,
> -chris
> 
> On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote:
> 
>> 
>> There can be some odd effects with this because the keys are not uniformly distributed.  Beware if you are using pre-split tables because the region traffic can be pretty unbalanced if you do a naive split.
>> 
>> On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <cf...@email.com> wrote:
>> I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.
>> 
> 
> 


Re: OT - Hash Code Creation

Posted by Ted Dunning <td...@maprtech.com>.
Just that base-64 is not uniformly distributed relative to a binary
representation.  This is simply  because it is all printable characters.  If
you do a 256 way pre-split based on a binary interpretation of the key, 64
regions will get traffic and 192 will get none.  Among other things, this
can seriously mess up benchmarking.  The situation is even worse with
decimal integer representations.

On Thu, Mar 17, 2011 at 11:19 AM, Chris Tarnas <cf...@email.com> wrote:

> I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not
> uniformly distributed? Or that a base-64 encoding is not evenly distributed?
>
> thanks,
> -chris
>
> On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote:
>
>
> There can be some odd effects with this because the keys are not uniformly
> distributed.  Beware if you are using pre-split tables because the region
> traffic can be pretty unbalanced if you do a naive split.
>
> On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <cf...@email.com> wrote:
>
>> I've been using base-64 encoding when I use my hashes as rowkeys - makes
>> them printable while still being fairly dense, IIRC a 64bit key should be
>> only 11 characters.
>
>
>
>

Re: OT - Hash Code Creation

Posted by Chris Tarnas <cf...@email.com>.
I'm not sure I am clear, are you saying 64 bit chunks of a MD5 keys are not uniformly distributed? Or that a base-64 encoding is not evenly distributed?

thanks,
-chris

On Mar 17, 2011, at 10:23 AM, Ted Dunning wrote:

> 
> There can be some odd effects with this because the keys are not uniformly distributed.  Beware if you are using pre-split tables because the region traffic can be pretty unbalanced if you do a naive split.
> 
> On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <cf...@email.com> wrote:
> I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.
> 


Re: OT - Hash Code Creation

Posted by Ted Dunning <td...@maprtech.com>.
There can be some odd effects with this because the keys are not uniformly
distributed.  Beware if you are using pre-split tables because the region
traffic can be pretty unbalanced if you do a naive split.

On Thu, Mar 17, 2011 at 9:20 AM, Chris Tarnas <cf...@email.com> wrote:

> I've been using base-64 encoding when I use my hashes as rowkeys - makes
> them printable while still being fairly dense, IIRC a 64bit key should be
> only 11 characters.

RE: OT - Hash Code Creation

Posted by Peter Haidinyak <ph...@local.com>.
Final tally on the import of a full days worth of search logs. The process started out at 12 seconds per log and ended at 15 seconds per log. Previously, the process started out at 24 seconds per log and ended at 154 seconds per log. I think I'll stay with my current Hash Code Generation Algorithm. Thanks Ted.

-Pete

-----Original Message-----
From: Peter Haidinyak [mailto:phaidinyak@local.com] 
Sent: Thursday, March 17, 2011 9:44 AM
To: user@hbase.apache.org
Subject: RE: OT - Hash Code Creation

Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed at and it seems to be working very well. I modified the original since only the last few bytes in the key are usually unique so I start at both ends when creating the Hash Code. So far I am half way through an import and the times went down from 24 seconds to 11 seconds on the first few files and have been holding around 13 seconds at half way vs 45 seconds with the old method.

Thanks

-Pete

-----Original Message-----
From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris Tarnas
Sent: Thursday, March 17, 2011 9:21 AM
To: user@hbase.apache.org
Subject: Re: OT - Hash Code Creation

With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.

-chris

On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote:

> Thanks, I'll give that a try.
> 
> -Pete
> 
> On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <td...@maprtech.com> wrote:
> 
>> Double hashing is a find thing.  To actually answer the question, though, I
>> would recommend Murmurhash or JOAAT (
>> http://en.wikipedia.org/wiki/Jenkins_hash_function)
>> 
>> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>> 
>>> Try hash table with double hashing.
>>> Something like this
>>> 
>>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>>> 
>>> 2011/3/17 Peter Haidinyak <ph...@local.com>
>>> 
>>> > Hi,
>>> >        This is a little off topic but this group seems pretty swift so I
>>> > thought I would ask. I am aggregating a day's worth of log data which
>>> means
>>> > I have a Map of over 24 million elements. What would be a good algorithm
>>> to
>>> > use for generating Hash Codes for these elements that cut down on
>>> > collisions? I application starts out reading in a log (144 logs in all)
>>> in
>>> > about 20 seconds and by the time I reach the last log it is taking around
>>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
>>> > I've played around with different Hashing algorithms and cut the original
>>> > time from over 300 seconds to 120 but I know I can do better.
>>> > The key I am using for the Map is an alpha-numeric string that is
>>> > approximately 16 character long with the last 4 or 5 character being the
>>> > most unique.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>> >
>>> > -Pete
>>> >
> 


RE: OT - Hash Code Creation

Posted by Peter Haidinyak <ph...@local.com>.
Hash Code in Object is limited to an int and a quick look at HashMap and Trove's HashMap looks like they are only using 31 bits of that. I am now trying a modified version of what Ted pointed at and it seems to be working very well. I modified the original since only the last few bytes in the key are usually unique so I start at both ends when creating the Hash Code. So far I am half way through an import and the times went down from 24 seconds to 11 seconds on the first few files and have been holding around 13 seconds at half way vs 45 seconds with the old method.

Thanks

-Pete

-----Original Message-----
From: Christopher Tarnas [mailto:cft@tarnas.org] On Behalf Of Chris Tarnas
Sent: Thursday, March 17, 2011 9:21 AM
To: user@hbase.apache.org
Subject: Re: OT - Hash Code Creation

With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.

-chris

On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote:

> Thanks, I'll give that a try.
> 
> -Pete
> 
> On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <td...@maprtech.com> wrote:
> 
>> Double hashing is a find thing.  To actually answer the question, though, I
>> would recommend Murmurhash or JOAAT (
>> http://en.wikipedia.org/wiki/Jenkins_hash_function)
>> 
>> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>> 
>>> Try hash table with double hashing.
>>> Something like this
>>> 
>>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>>> 
>>> 2011/3/17 Peter Haidinyak <ph...@local.com>
>>> 
>>> > Hi,
>>> >        This is a little off topic but this group seems pretty swift so I
>>> > thought I would ask. I am aggregating a day's worth of log data which
>>> means
>>> > I have a Map of over 24 million elements. What would be a good algorithm
>>> to
>>> > use for generating Hash Codes for these elements that cut down on
>>> > collisions? I application starts out reading in a log (144 logs in all)
>>> in
>>> > about 20 seconds and by the time I reach the last log it is taking around
>>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
>>> > I've played around with different Hashing algorithms and cut the original
>>> > time from over 300 seconds to 120 but I know I can do better.
>>> > The key I am using for the Map is an alpha-numeric string that is
>>> > approximately 16 character long with the last 4 or 5 character being the
>>> > most unique.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>> >
>>> > -Pete
>>> >
> 


Re: OT - Hash Code Creation

Posted by Chris Tarnas <cf...@email.com>.
With 24 million elements you'd probably want a 64bit hash to minimize the risk of collision, the rule of thumb is with 64bit hash key expect a collision when you reach about 2^32 elements in your set. I half of a 128bit MD5 sum (a cryptographic hash so you can only use parts of it if you want) as that is readily available in our systems and so far has not been a bottleneck. I believe there is now a 64bit murmurhash, that would be faster to compute and ideal for what you want. I've been using base-64 encoding when I use my hashes as rowkeys - makes them printable while still being fairly dense, IIRC a 64bit key should be only 11 characters.

-chris

On Mar 17, 2011, at 12:30 AM, Pete Haidinyak wrote:

> Thanks, I'll give that a try.
> 
> -Pete
> 
> On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <td...@maprtech.com> wrote:
> 
>> Double hashing is a find thing.  To actually answer the question, though, I
>> would recommend Murmurhash or JOAAT (
>> http://en.wikipedia.org/wiki/Jenkins_hash_function)
>> 
>> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com> wrote:
>> 
>>> Try hash table with double hashing.
>>> Something like this
>>> 
>>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>>> 
>>> 2011/3/17 Peter Haidinyak <ph...@local.com>
>>> 
>>> > Hi,
>>> >        This is a little off topic but this group seems pretty swift so I
>>> > thought I would ask. I am aggregating a day's worth of log data which
>>> means
>>> > I have a Map of over 24 million elements. What would be a good algorithm
>>> to
>>> > use for generating Hash Codes for these elements that cut down on
>>> > collisions? I application starts out reading in a log (144 logs in all)
>>> in
>>> > about 20 seconds and by the time I reach the last log it is taking around
>>> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
>>> > I've played around with different Hashing algorithms and cut the original
>>> > time from over 300 seconds to 120 but I know I can do better.
>>> > The key I am using for the Map is an alpha-numeric string that is
>>> > approximately 16 character long with the last 4 or 5 character being the
>>> > most unique.
>>> >
>>> > Any ideas?
>>> >
>>> > Thanks
>>> >
>>> > -Pete
>>> >
> 


Re: OT - Hash Code Creation

Posted by Pete Haidinyak <ja...@cox.net>.
Thanks, I'll give that a try.

-Pete

On Thu, 17 Mar 2011 00:23:00 -0700, Ted Dunning <td...@maprtech.com>  
wrote:

> Double hashing is a find thing.  To actually answer the question,  
> though, I
> would recommend Murmurhash or JOAAT (
> http://en.wikipedia.org/wiki/Jenkins_hash_function)
>
> On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com>  
> wrote:
>
>> Try hash table with double hashing.
>> Something like this
>>
>> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>>
>> 2011/3/17 Peter Haidinyak <ph...@local.com>
>>
>> > Hi,
>> >        This is a little off topic but this group seems pretty swift  
>> so I
>> > thought I would ask. I am aggregating a day's worth of log data which
>> means
>> > I have a Map of over 24 million elements. What would be a good  
>> algorithm
>> to
>> > use for generating Hash Codes for these elements that cut down on
>> > collisions? I application starts out reading in a log (144 logs in  
>> all)
>> in
>> > about 20 seconds and by the time I reach the last log it is taking  
>> around
>> > 120 seconds. The extra 100 seconds have to do with Hash Table  
>> Collisions.
>> > I've played around with different Hashing algorithms and cut the  
>> original
>> > time from over 300 seconds to 120 but I know I can do better.
>> > The key I am using for the Map is an alpha-numeric string that is
>> > approximately 16 character long with the last 4 or 5 character being  
>> the
>> > most unique.
>> >
>> > Any ideas?
>> >
>> > Thanks
>> >
>> > -Pete
>> >


Re: OT - Hash Code Creation

Posted by Ted Dunning <td...@maprtech.com>.
Double hashing is a find thing.  To actually answer the question, though, I
would recommend Murmurhash or JOAAT (
http://en.wikipedia.org/wiki/Jenkins_hash_function)

On Wed, Mar 16, 2011 at 3:48 PM, Andrey Stepachev <oc...@gmail.com> wrote:

> Try hash table with double hashing.
> Something like this
>
> http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm
>
> 2011/3/17 Peter Haidinyak <ph...@local.com>
>
> > Hi,
> >        This is a little off topic but this group seems pretty swift so I
> > thought I would ask. I am aggregating a day's worth of log data which
> means
> > I have a Map of over 24 million elements. What would be a good algorithm
> to
> > use for generating Hash Codes for these elements that cut down on
> > collisions? I application starts out reading in a log (144 logs in all)
> in
> > about 20 seconds and by the time I reach the last log it is taking around
> > 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
> > I've played around with different Hashing algorithms and cut the original
> > time from over 300 seconds to 120 but I know I can do better.
> > The key I am using for the Map is an alpha-numeric string that is
> > approximately 16 character long with the last 4 or 5 character being the
> > most unique.
> >
> > Any ideas?
> >
> > Thanks
> >
> > -Pete
> >
>

Re: OT - Hash Code Creation

Posted by Andrey Stepachev <oc...@gmail.com>.
Try hash table with double hashing.
Something like this
http://www.java2s.com/Code/Java/Collections-Data-Structure/Hashtablewithdoublehashing.htm

2011/3/17 Peter Haidinyak <ph...@local.com>

> Hi,
>        This is a little off topic but this group seems pretty swift so I
> thought I would ask. I am aggregating a day's worth of log data which means
> I have a Map of over 24 million elements. What would be a good algorithm to
> use for generating Hash Codes for these elements that cut down on
> collisions? I application starts out reading in a log (144 logs in all) in
> about 20 seconds and by the time I reach the last log it is taking around
> 120 seconds. The extra 100 seconds have to do with Hash Table Collisions.
> I've played around with different Hashing algorithms and cut the original
> time from over 300 seconds to 120 but I know I can do better.
> The key I am using for the Map is an alpha-numeric string that is
> approximately 16 character long with the last 4 or 5 character being the
> most unique.
>
> Any ideas?
>
> Thanks
>
> -Pete
>

OT - Hash Code Creation

Posted by Peter Haidinyak <ph...@local.com>.
Hi,
	This is a little off topic but this group seems pretty swift so I thought I would ask. I am aggregating a day's worth of log data which means I have a Map of over 24 million elements. What would be a good algorithm to use for generating Hash Codes for these elements that cut down on collisions? I application starts out reading in a log (144 logs in all) in about 20 seconds and by the time I reach the last log it is taking around 120 seconds. The extra 100 seconds have to do with Hash Table Collisions. I've played around with different Hashing algorithms and cut the original time from over 300 seconds to 120 but I know I can do better.
The key I am using for the Map is an alpha-numeric string that is approximately 16 character long with the last 4 or 5 character being the most unique.
  
Any ideas? 

Thanks

-Pete

Re: Row Counters

Posted by Bill Graham <bi...@gmail.com>.
Back to the issue of keeping a count, I've often wondered if this
would be easy to do without much cost at compaction time? It of course
wouldn't be a true real-time total but something like a
compactedRowCount. It could be a useful metric to expose via JMX to
get a feel for growth over time.


On Wed, Mar 16, 2011 at 3:40 PM, Vivek Krishna <vi...@gmail.com> wrote:
> Works. Thanks.
> Viv
>
>
>
> On Wed, Mar 16, 2011 at 6:21 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> The connection loss was due to inability of finding zookeeper quorum
>>
>> Use the commandline in my previous email.
>>
>>
>> On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna <vi...@gmail.com>wrote:
>>
>>> Oops. sorry about the environment.
>>>
>>> I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
>>> and zookeeper-3.3.2-CDH3B4.
>>>
>>> I was able to configure jars and run the command,
>>>
>>> hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,
>>>
>>> but I get
>>>
>>> java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
>>>      at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>>>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>      at org.apache.hadoop.mapred.Child.main(Child.java:234)
>>>
>>>
>>> The previous error in the task's full log is ..
>>>
>>>
>>> 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>>>      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
>>>      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
>>>      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:292)
>>>      at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
>>>      at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
>>>      at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
>>>      at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
>>>      at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>>>      at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>>>      at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
>>>      at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>>>      at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>>>      at java.security.AccessController.doPrivileged(Native Method)
>>>      at javax.security.auth.Subject.doAs(Subject.java:396)
>>>      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>      at org.apache.hadoop.mapred.Child.main(Child.java:234)
>>> Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>>>      at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
>>>      at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
>>>      ... 15 more
>>> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>>>      at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>>>      at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>>>      at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
>>>      at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
>>>      ... 16 more
>>>
>>>
>>> find I am pretty sure zookeeper master is running in the same machine at
>>> port 2181.  Not sure why the connection loss occurs.  Do I need
>>> HBASE-3578 <https://issues.apache.org/jira/browse/HBASE-3578> by any
>>> chance?
>>>
>>> Viv
>>>
>>>
>>>
>>>
>>> On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu <yu...@gmail.com> wrote:
>>>
>>>> In the future, describe your environment a bit.
>>>>
>>>> The way I approach this is:
>>>> find the correct commandline from
>>>> src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
>>>>
>>>> Then I issue:
>>>> [hadoop@us01-ciqps1-name01 hbase]$
>>>> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
>>>> classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
>>>> rowcounter packageindex
>>>>
>>>> Then I check the map/reduce task on job tracker URL
>>>>
>>>> On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna <vivekrishna@gmail.com
>>>> >wrote:
>>>>
>>>> > I guess it is using the mapred class
>>>> >
>>>> > 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
>>>> > attempt_201103161245_0005_m_000004_0, Status : FAILED
>>>> > java.io.IOException: Cannot create a record reader because of a
>>>> previous
>>>> > error. Please look at the previous logs lines from the task's full log
>>>> for
>>>> > more details.
>>>> >  at
>>>> >
>>>> >
>>>> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>>>> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>>> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>>>> > at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>>>> >  at java.security.AccessController.doPrivileged(Native Method)
>>>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>>>> >  at
>>>> >
>>>> >
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>>> > at org.apache.hadoop.mapred.Child.main(Child.java:234)
>>>> >
>>>> > How do I use mapreduce class?
>>>> > Viv
>>>> >
>>>> >
>>>> >
>>>> > On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:
>>>> >
>>>> > > Since we have lived so long without this information, I guess we can
>>>> hold
>>>> > > for longer :-)
>>>> > > Another issue I am working on is to reduce memory footprint. See the
>>>> > > following discussion thread:
>>>> > > One of the regionserver aborted, then the master shut down itself
>>>> > >
>>>> > > We have to bear in mind that there would be around 10K regions or
>>>> more in
>>>> > > production.
>>>> > >
>>>> > > Cheers
>>>> > >
>>>> > > On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com>
>>>> > wrote:
>>>> > >
>>>> > > > Just a random thought.  What about keeping a per region row count?
>>>> >  Then
>>>> > > if
>>>> > > > you needed to get a row count for a table you'd just have to query
>>>> each
>>>> > > > region once and sum.  Seems like it wouldn't be too expensive
>>>> because
>>>> > > you'd
>>>> > > > just have a row counter variable.  It maybe more complicated than
>>>> I'm
>>>> > > making
>>>> > > > it out to be though...
>>>> > > >
>>>> > > > ~Jeff
>>>> > > >
>>>> > > >
>>>> > > > On 3/16/2011 2:40 PM, Stack wrote:
>>>> > > >
>>>> > > >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<
>>>> vivekrishna@gmail.com>
>>>> > > >>  wrote:
>>>> > > >>
>>>> > > >>> 1.  How do I count rows fast in hbase?
>>>> > > >>>
>>>> > > >>> First I tired count 'test'  , takes ages.
>>>> > > >>>
>>>> > > >>> Saw that I could use RowCounter, but looks like it is deprecated.
>>>> > > >>>
>>>> > > >> It is not.  Make sure you are using the one from mapreduce package
>>>> as
>>>> > > >> opposed to mapred package.
>>>> > > >>
>>>> > > >>
>>>> > > >>  I just need to verify the total counts.  Is it possible to see
>>>> > > somewhere
>>>> > > >>> in
>>>> > > >>> the web interface or ganglia or by any other means?
>>>> > > >>>
>>>> > > >>>  We don't keep a current count on a table.  Too expensive.  Run
>>>> the
>>>> > > >> rowcounter MR job.  This page may be of help:
>>>> > > >>
>>>> > > >>
>>>> > >
>>>> >
>>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>>>> > > >>
>>>> > > >> Good luck,
>>>> > > >> St.Ack
>>>> > > >>
>>>> > > >
>>>> > > > --
>>>> > > > Jeff Whiting
>>>> > > > Qualtrics Senior Software Engineer
>>>> > > > jeffw@qualtrics.com
>>>> > > >
>>>> > > >
>>>> > >
>>>> >
>>>>
>>>
>>>
>>
>

Re: Row Counters

Posted by Vivek Krishna <vi...@gmail.com>.
Works. Thanks.
Viv



On Wed, Mar 16, 2011 at 6:21 PM, Ted Yu <yu...@gmail.com> wrote:

> The connection loss was due to inability of finding zookeeper quorum
>
> Use the commandline in my previous email.
>
>
> On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna <vi...@gmail.com>wrote:
>
>> Oops. sorry about the environment.
>>
>> I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
>> and zookeeper-3.3.2-CDH3B4.
>>
>> I was able to configure jars and run the command,
>>
>> hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,
>>
>> but I get
>>
>> java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
>> 	at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:234)
>>
>>
>> The previous error in the task's full log is ..
>>
>>
>> 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
>> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
>> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:292)
>> 	at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
>> 	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
>> 	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
>> 	at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
>> 	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
>> 	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
>> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
>> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>> 	at java.security.AccessController.doPrivileged(Native Method)
>> 	at javax.security.auth.Subject.doAs(Subject.java:396)
>> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>> 	at org.apache.hadoop.mapred.Child.main(Child.java:234)
>> Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
>> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
>> 	... 15 more
>> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
>> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
>> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
>> 	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
>> 	at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
>> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
>> 	... 16 more
>>
>>
>> find I am pretty sure zookeeper master is running in the same machine at
>> port 2181.  Not sure why the connection loss occurs.  Do I need
>> HBASE-3578 <https://issues.apache.org/jira/browse/HBASE-3578> by any
>> chance?
>>
>> Viv
>>
>>
>>
>>
>> On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu <yu...@gmail.com> wrote:
>>
>>> In the future, describe your environment a bit.
>>>
>>> The way I approach this is:
>>> find the correct commandline from
>>> src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
>>>
>>> Then I issue:
>>> [hadoop@us01-ciqps1-name01 hbase]$
>>> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
>>> classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
>>> rowcounter packageindex
>>>
>>> Then I check the map/reduce task on job tracker URL
>>>
>>> On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna <vivekrishna@gmail.com
>>> >wrote:
>>>
>>> > I guess it is using the mapred class
>>> >
>>> > 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
>>> > attempt_201103161245_0005_m_000004_0, Status : FAILED
>>> > java.io.IOException: Cannot create a record reader because of a
>>> previous
>>> > error. Please look at the previous logs lines from the task's full log
>>> for
>>> > more details.
>>> >  at
>>> >
>>> >
>>> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>>> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>>> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>>> > at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>>> >  at java.security.AccessController.doPrivileged(Native Method)
>>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>>> >  at
>>> >
>>> >
>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>>> > at org.apache.hadoop.mapred.Child.main(Child.java:234)
>>> >
>>> > How do I use mapreduce class?
>>> > Viv
>>> >
>>> >
>>> >
>>> > On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:
>>> >
>>> > > Since we have lived so long without this information, I guess we can
>>> hold
>>> > > for longer :-)
>>> > > Another issue I am working on is to reduce memory footprint. See the
>>> > > following discussion thread:
>>> > > One of the regionserver aborted, then the master shut down itself
>>> > >
>>> > > We have to bear in mind that there would be around 10K regions or
>>> more in
>>> > > production.
>>> > >
>>> > > Cheers
>>> > >
>>> > > On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com>
>>> > wrote:
>>> > >
>>> > > > Just a random thought.  What about keeping a per region row count?
>>> >  Then
>>> > > if
>>> > > > you needed to get a row count for a table you'd just have to query
>>> each
>>> > > > region once and sum.  Seems like it wouldn't be too expensive
>>> because
>>> > > you'd
>>> > > > just have a row counter variable.  It maybe more complicated than
>>> I'm
>>> > > making
>>> > > > it out to be though...
>>> > > >
>>> > > > ~Jeff
>>> > > >
>>> > > >
>>> > > > On 3/16/2011 2:40 PM, Stack wrote:
>>> > > >
>>> > > >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<
>>> vivekrishna@gmail.com>
>>> > > >>  wrote:
>>> > > >>
>>> > > >>> 1.  How do I count rows fast in hbase?
>>> > > >>>
>>> > > >>> First I tired count 'test'  , takes ages.
>>> > > >>>
>>> > > >>> Saw that I could use RowCounter, but looks like it is deprecated.
>>> > > >>>
>>> > > >> It is not.  Make sure you are using the one from mapreduce package
>>> as
>>> > > >> opposed to mapred package.
>>> > > >>
>>> > > >>
>>> > > >>  I just need to verify the total counts.  Is it possible to see
>>> > > somewhere
>>> > > >>> in
>>> > > >>> the web interface or ganglia or by any other means?
>>> > > >>>
>>> > > >>>  We don't keep a current count on a table.  Too expensive.  Run
>>> the
>>> > > >> rowcounter MR job.  This page may be of help:
>>> > > >>
>>> > > >>
>>> > >
>>> >
>>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>>> > > >>
>>> > > >> Good luck,
>>> > > >> St.Ack
>>> > > >>
>>> > > >
>>> > > > --
>>> > > > Jeff Whiting
>>> > > > Qualtrics Senior Software Engineer
>>> > > > jeffw@qualtrics.com
>>> > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Row Counters

Posted by Ted Yu <yu...@gmail.com>.
The connection loss was due to inability of finding zookeeper quorum

Use the commandline in my previous email.

On Wed, Mar 16, 2011 at 3:18 PM, Vivek Krishna <vi...@gmail.com>wrote:

> Oops. sorry about the environment.
>
> I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
> and zookeeper-3.3.2-CDH3B4.
>
> I was able to configure jars and run the command,
>
> hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,
>
> but I get
>
> java.io.IOException: Cannot create a record reader because of a previous error. Please look at the previous logs lines from the task's full log for more details.
> 	at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:234)
>
>
> The previous error in the task's full log is ..
>
>
> 2011-03-16 21:41:03,367 ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormat: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:292)
> 	at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
> 	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
> 	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
> 	at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
> 	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
> 	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
> 	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:396)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:234)
> Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
> 	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
> 	... 15 more
> Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
> 	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
> 	at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
> 	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
> 	... 16 more
>
>
> find I am pretty sure zookeeper master is running in the same machine at
> port 2181.  Not sure why the connection loss occurs.  Do I need HBASE-3578<https://issues.apache.org/jira/browse/HBASE-3578>by any chance?
>
> Viv
>
>
>
>
> On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu <yu...@gmail.com> wrote:
>
>> In the future, describe your environment a bit.
>>
>> The way I approach this is:
>> find the correct commandline from
>> src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
>>
>> Then I issue:
>> [hadoop@us01-ciqps1-name01 hbase]$
>> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
>> classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
>> rowcounter packageindex
>>
>> Then I check the map/reduce task on job tracker URL
>>
>> On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna <vivekrishna@gmail.com
>> >wrote:
>>
>> > I guess it is using the mapred class
>> >
>> > 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
>> > attempt_201103161245_0005_m_000004_0, Status : FAILED
>> > java.io.IOException: Cannot create a record reader because of a previous
>> > error. Please look at the previous logs lines from the task's full log
>> for
>> > more details.
>> >  at
>> >
>> >
>> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
>> > at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>> >  at java.security.AccessController.doPrivileged(Native Method)
>> > at javax.security.auth.Subject.doAs(Subject.java:396)
>> >  at
>> >
>> >
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
>> > at org.apache.hadoop.mapred.Child.main(Child.java:234)
>> >
>> > How do I use mapreduce class?
>> > Viv
>> >
>> >
>> >
>> > On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:
>> >
>> > > Since we have lived so long without this information, I guess we can
>> hold
>> > > for longer :-)
>> > > Another issue I am working on is to reduce memory footprint. See the
>> > > following discussion thread:
>> > > One of the regionserver aborted, then the master shut down itself
>> > >
>> > > We have to bear in mind that there would be around 10K regions or more
>> in
>> > > production.
>> > >
>> > > Cheers
>> > >
>> > > On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com>
>> > wrote:
>> > >
>> > > > Just a random thought.  What about keeping a per region row count?
>> >  Then
>> > > if
>> > > > you needed to get a row count for a table you'd just have to query
>> each
>> > > > region once and sum.  Seems like it wouldn't be too expensive
>> because
>> > > you'd
>> > > > just have a row counter variable.  It maybe more complicated than
>> I'm
>> > > making
>> > > > it out to be though...
>> > > >
>> > > > ~Jeff
>> > > >
>> > > >
>> > > > On 3/16/2011 2:40 PM, Stack wrote:
>> > > >
>> > > >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<
>> vivekrishna@gmail.com>
>> > > >>  wrote:
>> > > >>
>> > > >>> 1.  How do I count rows fast in hbase?
>> > > >>>
>> > > >>> First I tired count 'test'  , takes ages.
>> > > >>>
>> > > >>> Saw that I could use RowCounter, but looks like it is deprecated.
>> > > >>>
>> > > >> It is not.  Make sure you are using the one from mapreduce package
>> as
>> > > >> opposed to mapred package.
>> > > >>
>> > > >>
>> > > >>  I just need to verify the total counts.  Is it possible to see
>> > > somewhere
>> > > >>> in
>> > > >>> the web interface or ganglia or by any other means?
>> > > >>>
>> > > >>>  We don't keep a current count on a table.  Too expensive.  Run
>> the
>> > > >> rowcounter MR job.  This page may be of help:
>> > > >>
>> > > >>
>> > >
>> >
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>> > > >>
>> > > >> Good luck,
>> > > >> St.Ack
>> > > >>
>> > > >
>> > > > --
>> > > > Jeff Whiting
>> > > > Qualtrics Senior Software Engineer
>> > > > jeffw@qualtrics.com
>> > > >
>> > > >
>> > >
>> >
>>
>
>

Re: Row Counters

Posted by Vivek Krishna <vi...@gmail.com>.
Oops. sorry about the environment.

I am using hadoop-0.20.2-CDH3B4, and hbase-0.90.1-CDH3B4
and zookeeper-3.3.2-CDH3B4.

I was able to configure jars and run the command,

hadoop jar /usr/lib/hbase/hbase-0.90.1-CDH3B4.jar rowcounter test,

but I get

java.io.IOException: Cannot create a record reader because of a
previous error. Please look at the previous logs lines from the task's
full log for more details.
	at org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
	at org.apache.hadoop.mapred.Child.main(Child.java:234)


The previous error in the task's full log is ..

2011-03-16 21:41:03,367 ERROR
org.apache.hadoop.hbase.mapreduce.TableInputFormat:
org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:988)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:301)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:292)
	at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:155)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:167)
	at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:145)
	at org.apache.hadoop.hbase.mapreduce.TableInputFormat.setConf(TableInputFormat.java:91)
	at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
	at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:605)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
	at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:396)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
	at org.apache.hadoop.mapred.Child.main(Child.java:234)
Caused by: org.apache.hadoop.hbase.ZooKeeperConnectionException:
org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:147)
	at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getZooKeeperWatcher(HConnectionManager.java:986)
	... 15 more
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss for /hbase
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:90)
	at org.apache.zookeeper.KeeperException.create(KeeperException.java:42)
	at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:637)
	at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:902)
	at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:133)
	... 16 more


find I am pretty sure zookeeper master is running in the same machine at
port 2181.  Not sure why the connection loss occurs.  Do I need
HBASE-3578<https://issues.apache.org/jira/browse/HBASE-3578>by any
chance?

Viv



On Wed, Mar 16, 2011 at 5:36 PM, Ted Yu <yu...@gmail.com> wrote:

> In the future, describe your environment a bit.
>
> The way I approach this is:
> find the correct commandline from
> src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java
>
> Then I issue:
> [hadoop@us01-ciqps1-name01 hbase]$
> HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
> classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
> rowcounter packageindex
>
> Then I check the map/reduce task on job tracker URL
>
> On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna <vivekrishna@gmail.com
> >wrote:
>
> > I guess it is using the mapred class
> >
> > 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
> > attempt_201103161245_0005_m_000004_0, Status : FAILED
> > java.io.IOException: Cannot create a record reader because of a previous
> > error. Please look at the previous logs lines from the task's full log
> for
> > more details.
> >  at
> >
> >
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
> > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
> >  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
> > at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
> >  at java.security.AccessController.doPrivileged(Native Method)
> > at javax.security.auth.Subject.doAs(Subject.java:396)
> >  at
> >
> >
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> > at org.apache.hadoop.mapred.Child.main(Child.java:234)
> >
> > How do I use mapreduce class?
> > Viv
> >
> >
> >
> > On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:
> >
> > > Since we have lived so long without this information, I guess we can
> hold
> > > for longer :-)
> > > Another issue I am working on is to reduce memory footprint. See the
> > > following discussion thread:
> > > One of the regionserver aborted, then the master shut down itself
> > >
> > > We have to bear in mind that there would be around 10K regions or more
> in
> > > production.
> > >
> > > Cheers
> > >
> > > On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com>
> > wrote:
> > >
> > > > Just a random thought.  What about keeping a per region row count?
> >  Then
> > > if
> > > > you needed to get a row count for a table you'd just have to query
> each
> > > > region once and sum.  Seems like it wouldn't be too expensive because
> > > you'd
> > > > just have a row counter variable.  It maybe more complicated than I'm
> > > making
> > > > it out to be though...
> > > >
> > > > ~Jeff
> > > >
> > > >
> > > > On 3/16/2011 2:40 PM, Stack wrote:
> > > >
> > > >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<
> vivekrishna@gmail.com>
> > > >>  wrote:
> > > >>
> > > >>> 1.  How do I count rows fast in hbase?
> > > >>>
> > > >>> First I tired count 'test'  , takes ages.
> > > >>>
> > > >>> Saw that I could use RowCounter, but looks like it is deprecated.
> > > >>>
> > > >> It is not.  Make sure you are using the one from mapreduce package
> as
> > > >> opposed to mapred package.
> > > >>
> > > >>
> > > >>  I just need to verify the total counts.  Is it possible to see
> > > somewhere
> > > >>> in
> > > >>> the web interface or ganglia or by any other means?
> > > >>>
> > > >>>  We don't keep a current count on a table.  Too expensive.  Run the
> > > >> rowcounter MR job.  This page may be of help:
> > > >>
> > > >>
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> > > >>
> > > >> Good luck,
> > > >> St.Ack
> > > >>
> > > >
> > > > --
> > > > Jeff Whiting
> > > > Qualtrics Senior Software Engineer
> > > > jeffw@qualtrics.com
> > > >
> > > >
> > >
> >
>

Re: Row Counters

Posted by Ted Yu <yu...@gmail.com>.
In the future, describe your environment a bit.

The way I approach this is:
find the correct commandline from
src/main/java/org/apache/hadoop/hbase/mapreduce/package-info.java

Then I issue:
[hadoop@us01-ciqps1-name01 hbase]$ HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase
classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.90.1.jar
rowcounter packageindex

Then I check the map/reduce task on job tracker URL

On Wed, Mar 16, 2011 at 1:59 PM, Vivek Krishna <vi...@gmail.com>wrote:

> I guess it is using the mapred class
>
> 11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
> attempt_201103161245_0005_m_000004_0, Status : FAILED
> java.io.IOException: Cannot create a record reader because of a previous
> error. Please look at the previous logs lines from the task's full log for
> more details.
>  at
>
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
> at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
>  at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:396)
>  at
>
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
> at org.apache.hadoop.mapred.Child.main(Child.java:234)
>
> How do I use mapreduce class?
> Viv
>
>
>
> On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:
>
> > Since we have lived so long without this information, I guess we can hold
> > for longer :-)
> > Another issue I am working on is to reduce memory footprint. See the
> > following discussion thread:
> > One of the regionserver aborted, then the master shut down itself
> >
> > We have to bear in mind that there would be around 10K regions or more in
> > production.
> >
> > Cheers
> >
> > On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com>
> wrote:
> >
> > > Just a random thought.  What about keeping a per region row count?
>  Then
> > if
> > > you needed to get a row count for a table you'd just have to query each
> > > region once and sum.  Seems like it wouldn't be too expensive because
> > you'd
> > > just have a row counter variable.  It maybe more complicated than I'm
> > making
> > > it out to be though...
> > >
> > > ~Jeff
> > >
> > >
> > > On 3/16/2011 2:40 PM, Stack wrote:
> > >
> > >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>
> > >>  wrote:
> > >>
> > >>> 1.  How do I count rows fast in hbase?
> > >>>
> > >>> First I tired count 'test'  , takes ages.
> > >>>
> > >>> Saw that I could use RowCounter, but looks like it is deprecated.
> > >>>
> > >> It is not.  Make sure you are using the one from mapreduce package as
> > >> opposed to mapred package.
> > >>
> > >>
> > >>  I just need to verify the total counts.  Is it possible to see
> > somewhere
> > >>> in
> > >>> the web interface or ganglia or by any other means?
> > >>>
> > >>>  We don't keep a current count on a table.  Too expensive.  Run the
> > >> rowcounter MR job.  This page may be of help:
> > >>
> > >>
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> > >>
> > >> Good luck,
> > >> St.Ack
> > >>
> > >
> > > --
> > > Jeff Whiting
> > > Qualtrics Senior Software Engineer
> > > jeffw@qualtrics.com
> > >
> > >
> >
>

Re: Row Counters

Posted by Vivek Krishna <vi...@gmail.com>.
I guess it is using the mapred class

11/03/16 20:58:27 INFO mapred.JobClient: Task Id :
attempt_201103161245_0005_m_000004_0, Status : FAILED
java.io.IOException: Cannot create a record reader because of a previous
error. Please look at the previous logs lines from the task's full log for
more details.
 at
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322)
at org.apache.hadoop.mapred.Child$4.run(Child.java:240)
 at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
 at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115)
at org.apache.hadoop.mapred.Child.main(Child.java:234)

How do I use mapreduce class?
Viv



On Wed, Mar 16, 2011 at 4:52 PM, Ted Yu <yu...@gmail.com> wrote:

> Since we have lived so long without this information, I guess we can hold
> for longer :-)
> Another issue I am working on is to reduce memory footprint. See the
> following discussion thread:
> One of the regionserver aborted, then the master shut down itself
>
> We have to bear in mind that there would be around 10K regions or more in
> production.
>
> Cheers
>
> On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com> wrote:
>
> > Just a random thought.  What about keeping a per region row count?  Then
> if
> > you needed to get a row count for a table you'd just have to query each
> > region once and sum.  Seems like it wouldn't be too expensive because
> you'd
> > just have a row counter variable.  It maybe more complicated than I'm
> making
> > it out to be though...
> >
> > ~Jeff
> >
> >
> > On 3/16/2011 2:40 PM, Stack wrote:
> >
> >> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>
> >>  wrote:
> >>
> >>> 1.  How do I count rows fast in hbase?
> >>>
> >>> First I tired count 'test'  , takes ages.
> >>>
> >>> Saw that I could use RowCounter, but looks like it is deprecated.
> >>>
> >> It is not.  Make sure you are using the one from mapreduce package as
> >> opposed to mapred package.
> >>
> >>
> >>  I just need to verify the total counts.  Is it possible to see
> somewhere
> >>> in
> >>> the web interface or ganglia or by any other means?
> >>>
> >>>  We don't keep a current count on a table.  Too expensive.  Run the
> >> rowcounter MR job.  This page may be of help:
> >>
> >>
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
> >>
> >> Good luck,
> >> St.Ack
> >>
> >
> > --
> > Jeff Whiting
> > Qualtrics Senior Software Engineer
> > jeffw@qualtrics.com
> >
> >
>

Re: Row Counters

Posted by Ted Yu <yu...@gmail.com>.
Since we have lived so long without this information, I guess we can hold
for longer :-)
Another issue I am working on is to reduce memory footprint. See the
following discussion thread:
One of the regionserver aborted, then the master shut down itself

We have to bear in mind that there would be around 10K regions or more in
production.

Cheers

On Wed, Mar 16, 2011 at 1:46 PM, Jeff Whiting <je...@qualtrics.com> wrote:

> Just a random thought.  What about keeping a per region row count?  Then if
> you needed to get a row count for a table you'd just have to query each
> region once and sum.  Seems like it wouldn't be too expensive because you'd
> just have a row counter variable.  It maybe more complicated than I'm making
> it out to be though...
>
> ~Jeff
>
>
> On 3/16/2011 2:40 PM, Stack wrote:
>
>> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>
>>  wrote:
>>
>>> 1.  How do I count rows fast in hbase?
>>>
>>> First I tired count 'test'  , takes ages.
>>>
>>> Saw that I could use RowCounter, but looks like it is deprecated.
>>>
>> It is not.  Make sure you are using the one from mapreduce package as
>> opposed to mapred package.
>>
>>
>>  I just need to verify the total counts.  Is it possible to see somewhere
>>> in
>>> the web interface or ganglia or by any other means?
>>>
>>>  We don't keep a current count on a table.  Too expensive.  Run the
>> rowcounter MR job.  This page may be of help:
>>
>> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>>
>> Good luck,
>> St.Ack
>>
>
> --
> Jeff Whiting
> Qualtrics Senior Software Engineer
> jeffw@qualtrics.com
>
>

RE: Row Counters

Posted by Peter Haidinyak <ph...@local.com>.
When I need to know a row count for table I kept a separate table just for that purpose and would update/query that table. Low tech but it worked.

-Pete

-----Original Message-----
From: Jeff Whiting [mailto:jeffw@qualtrics.com] 
Sent: Wednesday, March 16, 2011 1:46 PM
To: user@hbase.apache.org
Cc: Stack
Subject: Re: Row Counters

Just a random thought.  What about keeping a per region row count?  Then if you needed to get a row 
count for a table you'd just have to query each region once and sum.  Seems like it wouldn't be too 
expensive because you'd just have a row counter variable.  It maybe more complicated than I'm making 
it out to be though...

~Jeff

On 3/16/2011 2:40 PM, Stack wrote:
> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>  wrote:
>> 1.  How do I count rows fast in hbase?
>>
>> First I tired count 'test'  , takes ages.
>>
>> Saw that I could use RowCounter, but looks like it is deprecated.
> It is not.  Make sure you are using the one from mapreduce package as
> opposed to mapred package.
>
>
>> I just need to verify the total counts.  Is it possible to see somewhere in
>> the web interface or ganglia or by any other means?
>>
> We don't keep a current count on a table.  Too expensive.  Run the
> rowcounter MR job.  This page may be of help:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>
> Good luck,
> St.Ack

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Re: Row Counters

Posted by Jeff Whiting <je...@qualtrics.com>.
Just a random thought.  What about keeping a per region row count?  Then if you needed to get a row 
count for a table you'd just have to query each region once and sum.  Seems like it wouldn't be too 
expensive because you'd just have a row counter variable.  It maybe more complicated than I'm making 
it out to be though...

~Jeff

On 3/16/2011 2:40 PM, Stack wrote:
> On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna<vi...@gmail.com>  wrote:
>> 1.  How do I count rows fast in hbase?
>>
>> First I tired count 'test'  , takes ages.
>>
>> Saw that I could use RowCounter, but looks like it is deprecated.
> It is not.  Make sure you are using the one from mapreduce package as
> opposed to mapred package.
>
>
>> I just need to verify the total counts.  Is it possible to see somewhere in
>> the web interface or ganglia or by any other means?
>>
> We don't keep a current count on a table.  Too expensive.  Run the
> rowcounter MR job.  This page may be of help:
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
>
> Good luck,
> St.Ack

-- 
Jeff Whiting
Qualtrics Senior Software Engineer
jeffw@qualtrics.com


Re: Row Counters

Posted by Stack <st...@duboce.net>.
On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna <vi...@gmail.com> wrote:
> 1.  How do I count rows fast in hbase?
>
> First I tired count 'test'  , takes ages.
>
> Saw that I could use RowCounter, but looks like it is deprecated.

It is not.  Make sure you are using the one from mapreduce package as
opposed to mapred package.


> I just need to verify the total counts.  Is it possible to see somewhere in
> the web interface or ganglia or by any other means?
>

We don't keep a current count on a table.  Too expensive.  Run the
rowcounter MR job.  This page may be of help:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description

Good luck,
St.Ack

Re: Row Counters

Posted by Ted Yu <yu...@gmail.com>.
$ ./bin/hadoop jar hbase*.jar rowcounter

Search for related discusson on search-hadoop

On Wed, Mar 16, 2011 at 1:35 PM, Vivek Krishna <vi...@gmail.com>wrote:

> 1.  How do I count rows fast in hbase?
>
> First I tired count 'test'  , takes ages.
>
> Saw that I could use RowCounter, but looks like it is deprecated.  When I
> try to use it, I get
>
> java.io.IOException: Cannot create a record reader because of a previous
> error. Please look at the previous logs lines from the task's full log for
> more details.
> at
>
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.createRecordReader(TableInputFormatBase.java:98)
>
> If this is deprecated, is there any other way of finding the counts?
>
> I just need to verify the total counts.  Is it possible to see somewhere in
> the web interface or ganglia or by any other means?
>
> Viv
>