You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Sheng Chen <ch...@gmail.com> on 2011/03/28 08:40:06 UTC

newbie question: how do I know the total number of rows of a cf?

Hi all,
I want to know how many records I am holding in Cassandra, just like
count(*) in sql.
What can I do ? Thank you.

Sheng

Re: newbie question: how do I know the total number of rows of a cf?

Posted by aaron morton <aa...@thelastpickle.com>.

It iterates over all the SSTables and disk and estimates the number of keys by looking at how big the index is. It does not count the actual keys. 

aaron


On 31 Mar 2011, at 17:46, Sheng Chen wrote:

> I just found an estmateKeys() method of the ColumnFamilyStoreMBean.
> Is there any indication about how it works?
> 
> Sheng
> 
> 2011/3/28 Sheng Chen <ch...@gmail.com>
> Hi all,
> I want to know how many records I am holding in Cassandra, just like count(*) in sql.
> What can I do ? Thank you.
> 
> Sheng
> 
> 
>

Re: What sort of load do the tombstones create on the cluster?

Posted by Mohit Anchlia <mo...@gmail.com>.

On Mon, Nov 21, 2011 at 11:47 AM, Edward Capriolo <ed...@gmail.com> wrote:
>
>
> On Mon, Nov 21, 2011 at 3:30 AM, Philippe <wa...@gmail.com> wrote:
>>
>> I don't remember your exact situation but could it be your network
>> connectivity?
>> I know I've been upgrading mine because I'm maxing out fastethernet on a
>> 12 node cluster.
>>
>> Le 20 nov. 2011 22:54, "Jahangir Mohammed" <md...@gmail.com> a
>> écrit :
>>>
>>> Mostly, they are I/O and CPU intensive during major compaction. If
>>> ganglia doesn't have anything suspicious there, then what is performance
>>> loss ? Read or write?
>>>
>>> On Nov 17, 2011 1:01 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
>>>>
>>>> In view of my unpleasant discovery last week that deletions in Cassandra
>>>> lead to a very real
>>>> and serious performance loss, I'm working on a strategy of moving
>>>> forward.
>>>>
>>>> If the tombstones do cause such problem, where should I be looking for
>>>> performance bottlenecks?
>>>> Is it disk, CPU or something else? Thing is, I don't see anything
>>>> outstanding in my Ganglia plots.
>>>>
>>>> TIA,
>>>>
>>>> Maxim
>>>>
>
> Tomstones do have a performance impact particularly in cases where data has
> a lot of data turnover and your are using the standard (non LevelDB
> compaction). Tombstones live on disk for gc_grace_seconds. First the
> tombstone takes up some small amount of space, which has an effect on disk
> caching. Secondly bloom filters having a tombstone has an effect on the read
> path. As a read for a row key will now match multiple bloom filters.
> If you are constantly adding and removing data and you have a long
> gc_grace_seconds (10 days is pretty long if your dataset is new every day
> for example) this is more profound then the use case that rarely deletes.
> This is why you will notice some use cases call for 'major compaction' while
> other people believe you should never need it.
> I force majors on some columns families because there is a high turnover and
> the data needs to be read often and the difference in data size is the
> difference between a 20GB size on disk that fits in VFS cache or a 35Gb size
> on disk that doesn't (and also may 'randomly' have a large compaction at
> peak time.)
> I am pretty excited about LevelDB because of how the tiered compaction looks
> to be more space efficient.

Have you got chance to do benchmarking on the LevelDB compaction?

>
>

Re: What sort of load do the tombstones create on the cluster?

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Nov 21, 2011 at 3:30 AM, Philippe <wa...@gmail.com> wrote:

> I don't remember your exact situation but could it be your network
> connectivity?
> I know I've been upgrading mine because I'm maxing out fastethernet on a
> 12 node cluster.
> Le 20 nov. 2011 22:54, "Jahangir Mohammed" <md...@gmail.com> a
> écrit :
>
>> Mostly, they are I/O and CPU intensive during major compaction. If
>> ganglia doesn't have anything suspicious there, then what is performance
>> loss ? Read or write?
>> On Nov 17, 2011 1:01 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
>>
>>> In view of my unpleasant discovery last week that deletions in Cassandra
>>> lead to a very real
>>> and serious performance loss, I'm working on a strategy of moving
>>> forward.
>>>
>>> If the tombstones do cause such problem, where should I be looking for
>>> performance bottlenecks?
>>> Is it disk, CPU or something else? Thing is, I don't see anything
>>> outstanding in my Ganglia plots.
>>>
>>> TIA,
>>>
>>> Maxim
>>>
>>>
Tomstones do have a performance impact particularly in cases where data has
a lot of data turnover and your are using the standard (non LevelDB
compaction). Tombstones live on disk for gc_grace_seconds. First the
tombstone takes up some small amount of space, which has an effect on disk
caching. Secondly bloom filters having a tombstone has an effect on the
read path. As a read for a row key will now match multiple bloom filters.

If you are constantly adding and removing data and you have a long
gc_grace_seconds (10 days is pretty long if your dataset is new every day
for example) this is more profound then the use case that rarely deletes.
This is why you will notice some use cases call for 'major compaction'
while other people believe you should never need it.

I force majors on some columns families because there is a high turnover
and the data needs to be read often and the difference in data size is the
difference between a 20GB size on disk that fits in VFS cache or a 35Gb
size on disk that doesn't (and also may 'randomly' have a large compaction
at peak time.)

I am pretty excited about LevelDB because of how the tiered compaction
looks to be more space efficient.

Re: What sort of load do the tombstones create on the cluster?

Posted by Philippe <wa...@gmail.com>.

I don't remember your exact situation but could it be your network
connectivity?
I know I've been upgrading mine because I'm maxing out fastethernet on a 12
node cluster.
Le 20 nov. 2011 22:54, "Jahangir Mohammed" <md...@gmail.com> a
écrit :

> Mostly, they are I/O and CPU intensive during major compaction. If ganglia
> doesn't have anything suspicious there, then what is performance loss ?
> Read or write?
> On Nov 17, 2011 1:01 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:
>
>> In view of my unpleasant discovery last week that deletions in Cassandra
>> lead to a very real
>> and serious performance loss, I'm working on a strategy of moving forward.
>>
>> If the tombstones do cause such problem, where should I be looking for
>> performance bottlenecks?
>> Is it disk, CPU or something else? Thing is, I don't see anything
>> outstanding in my Ganglia plots.
>>
>> TIA,
>>
>> Maxim
>>
>>

Re: What sort of load do the tombstones create on the cluster?

Posted by Jahangir Mohammed <md...@gmail.com>.

Mostly, they are I/O and CPU intensive during major compaction. If ganglia
doesn't have anything suspicious there, then what is performance loss ?
Read or write?
On Nov 17, 2011 1:01 PM, "Maxim Potekhin" <po...@bnl.gov> wrote:

> In view of my unpleasant discovery last week that deletions in Cassandra
> lead to a very real
> and serious performance loss, I'm working on a strategy of moving forward.
>
> If the tombstones do cause such problem, where should I be looking for
> performance bottlenecks?
> Is it disk, CPU or something else? Thing is, I don't see anything
> outstanding in my Ganglia plots.
>
> TIA,
>
> Maxim
>
>

Re: What sort of load do the tombstones create on the cluster?

Posted by Aaron Turner <sy...@gmail.com>.

What do you mean "performance loss"?  For example are you seeing it on
the read or write side?  During compactions? Are deletions them selves
expensive (they shouldn't be) but if you have a lot of tombstones that
haven't been compacted away that will make reads slower since there is
more data to scan.  One thing to try is kicking of major compactions
more often so they're smaller (less load) and clean out the deleted
data more often.

You should be able to tell if it is disk or CPU pretty easily via the
JMX interface (jconsole or OpsCenter can read those values) or
something like iostat.  Basically look for high disk IO wait... if you
see that it is disk.  If not, it's CPU.

One optimization I'm doing in my application is choosing row keys so
that I can delete an entire row at a time rather then individual
columns so there is only one tombstone for the whole row.  This isn't
always possible, but if you can layout your data in a way that makes
this possible, it's a good optimization.

On Thu, Nov 17, 2011 at 10:01 AM, Maxim Potekhin <po...@bnl.gov> wrote:
> In view of my unpleasant discovery last week that deletions in Cassandra
> lead to a very real
> and serious performance loss, I'm working on a strategy of moving forward.
>
> If the tombstones do cause such problem, where should I be looking for
> performance bottlenecks?
> Is it disk, CPU or something else? Thing is, I don't see anything
> outstanding in my Ganglia plots.
>
> TIA,
>
> Maxim
>
>

-- 
Aaron Turner
http://synfin.net/         Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
    -- Benjamin Franklin
"carpe diem quam minimum credula postero"

What sort of load do the tombstones create on the cluster?

Posted by Maxim Potekhin <po...@bnl.gov>.

In view of my unpleasant discovery last week that deletions in Cassandra 
lead to a very real
and serious performance loss, I'm working on a strategy of moving forward.

If the tombstones do cause such problem, where should I be looking for 
performance bottlenecks?
Is it disk, CPU or something else? Thing is, I don't see anything 
outstanding in my Ganglia plots.

TIA,

Maxim

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Sheng Chen <ch...@gmail.com>.

I just found an estmateKeys() method of the ColumnFamilyStoreMBean.
Is there any indication about how it works?

Sheng

2011/3/28 Sheng Chen <ch...@gmail.com>

> Hi all,
> I want to know how many records I am holding in Cassandra, just like
> count(*) in sql.
> What can I do ? Thank you.
>
> Sheng
>
>
>

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Stephen Connolly <st...@gmail.com>.

ok, so not all nosql has column families...

just

s/nosql/cassandra/g

on my previous post ;-)

On 28 March 2011 13:38, Joshua Partogi <jo...@gmail.com> wrote:
> Not all NoSQL is like that. Or perhaps the term NoSQL has became vague
> these days.
>
> On Mon, Mar 28, 2011 at 6:16 PM, Stephen Connolly
> <st...@gmail.com> wrote:
>> iterate.
>>
>> otherwise if that will be too slow and you will do it often, the nosql way
>> is to create a separate column family updated with each row add/delete to
>> hold the answer for you.
>>
>> - Stephen
>>
>> ---
>> Sent from my Android phone, so random spelling mistakes, random nonsense
>> words and other nonsense are a direct result of using swype to type on the
>> screen
>>
>> On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
>>> Hi all,
>>> I want to know how many records I am holding in Cassandra, just like
>>> count(*) in sql.
>>> What can I do ? Thank you.
>>>
>>> Sheng
>>
>
>
>
> --
> http://twitter.com/jpartogi
>

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Sheng Chen <ch...@gmail.com>.

Thanks all.

2011/3/28 Stephen Connolly <st...@gmail.com>

> for #2 you could pipe through wc -l to get the answer
>
> sort -n keys.txt | uniq | wc -l
>
> but both examples are just refinements of iterate.
>
> #1 is just a distributed iterate
> #2 is just an optimized iterate based on knowledge of the on-disk
> format (and my give inaccurate results... tombstones...)
>
> On 28 March 2011 14:16, Or Yanay <or...@peer39.com> wrote:
> > I use one of two ways to achieve that:
> >  1. run a map reduce. Pig is really helpful in these cases. Make sure you
> run your MR using Hadoop task tracker on your nodes - or your performance
> will take a hit.
> >  2. dump all keys using sstablekeys script from relevant files on all
> machines and count unique values. I do that using "sort -n  keys.txt |uniq
> >> unique_keys.txt"
> >
> > Dumping all keys is much faster but less elegant and can be more annoying
> if you want do that from your application.
> >
> > Hope that do the trick for you.
> > -Orr
> >
> > -----Original Message-----
> > From: Joshua Partogi [mailto:joshua.java@gmail.com]
> > Sent: Monday, March 28, 2011 2:39 PM
> > To: user@cassandra.apache.org
> > Subject: Re: newbie question: how do I know the total number of rows of a
> cf?
> >
> > Not all NoSQL is like that. Or perhaps the term NoSQL has became vague
> > these days.
> >
> > On Mon, Mar 28, 2011 at 6:16 PM, Stephen Connolly
> > <st...@gmail.com> wrote:
> >> iterate.
> >>
> >> otherwise if that will be too slow and you will do it often, the nosql
> way
> >> is to create a separate column family updated with each row add/delete
> to
> >> hold the answer for you.
> >>
> >> - Stephen
> >>
> >> ---
> >> Sent from my Android phone, so random spelling mistakes, random nonsense
> >> words and other nonsense are a direct result of using swype to type on
> the
> >> screen
> >>
> >> On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
> >>> Hi all,
> >>> I want to know how many records I am holding in Cassandra, just like
> >>> count(*) in sql.
> >>> What can I do ? Thank you.
> >>>
> >>> Sheng
> >>
> >
> >
> >
> > --
> > http://twitter.com/jpartogi
> >
>

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Stephen Connolly <st...@gmail.com>.

for #2 you could pipe through wc -l to get the answer

sort -n keys.txt | uniq | wc -l

but both examples are just refinements of iterate.

#1 is just a distributed iterate
#2 is just an optimized iterate based on knowledge of the on-disk
format (and my give inaccurate results... tombstones...)

On 28 March 2011 14:16, Or Yanay <or...@peer39.com> wrote:
> I use one of two ways to achieve that:
>  1. run a map reduce. Pig is really helpful in these cases. Make sure you run your MR using Hadoop task tracker on your nodes - or your performance will take a hit.
>  2. dump all keys using sstablekeys script from relevant files on all machines and count unique values. I do that using "sort -n  keys.txt |uniq >> unique_keys.txt"
>
> Dumping all keys is much faster but less elegant and can be more annoying if you want do that from your application.
>
> Hope that do the trick for you.
> -Orr
>
> -----Original Message-----
> From: Joshua Partogi [mailto:joshua.java@gmail.com]
> Sent: Monday, March 28, 2011 2:39 PM
> To: user@cassandra.apache.org
> Subject: Re: newbie question: how do I know the total number of rows of a cf?
>
> Not all NoSQL is like that. Or perhaps the term NoSQL has became vague
> these days.
>
> On Mon, Mar 28, 2011 at 6:16 PM, Stephen Connolly
> <st...@gmail.com> wrote:
>> iterate.
>>
>> otherwise if that will be too slow and you will do it often, the nosql way
>> is to create a separate column family updated with each row add/delete to
>> hold the answer for you.
>>
>> - Stephen
>>
>> ---
>> Sent from my Android phone, so random spelling mistakes, random nonsense
>> words and other nonsense are a direct result of using swype to type on the
>> screen
>>
>> On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
>>> Hi all,
>>> I want to know how many records I am holding in Cassandra, just like
>>> count(*) in sql.
>>> What can I do ? Thank you.
>>>
>>> Sheng
>>
>
>
>
> --
> http://twitter.com/jpartogi
>

RE: newbie question: how do I know the total number of rows of a cf?

Posted by Or Yanay <or...@peer39.com>.

I use one of two ways to achieve that:
  1. run a map reduce. Pig is really helpful in these cases. Make sure you run your MR using Hadoop task tracker on your nodes - or your performance will take a hit.
  2. dump all keys using sstablekeys script from relevant files on all machines and count unique values. I do that using "sort -n  keys.txt |uniq >> unique_keys.txt"

Dumping all keys is much faster but less elegant and can be more annoying if you want do that from your application.

Hope that do the trick for you.
-Orr

-----Original Message-----
From: Joshua Partogi [mailto:joshua.java@gmail.com] 
Sent: Monday, March 28, 2011 2:39 PM
To: user@cassandra.apache.org
Subject: Re: newbie question: how do I know the total number of rows of a cf?

Not all NoSQL is like that. Or perhaps the term NoSQL has became vague
these days.

On Mon, Mar 28, 2011 at 6:16 PM, Stephen Connolly
<st...@gmail.com> wrote:
> iterate.
>
> otherwise if that will be too slow and you will do it often, the nosql way
> is to create a separate column family updated with each row add/delete to
> hold the answer for you.
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
>
> On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
>> Hi all,
>> I want to know how many records I am holding in Cassandra, just like
>> count(*) in sql.
>> What can I do ? Thank you.
>>
>> Sheng
>

-- 
http://twitter.com/jpartogi

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Joshua Partogi <jo...@gmail.com>.

Not all NoSQL is like that. Or perhaps the term NoSQL has became vague
these days.

On Mon, Mar 28, 2011 at 6:16 PM, Stephen Connolly
<st...@gmail.com> wrote:
> iterate.
>
> otherwise if that will be too slow and you will do it often, the nosql way
> is to create a separate column family updated with each row add/delete to
> hold the answer for you.
>
> - Stephen
>
> ---
> Sent from my Android phone, so random spelling mistakes, random nonsense
> words and other nonsense are a direct result of using swype to type on the
> screen
>
> On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
>> Hi all,
>> I want to know how many records I am holding in Cassandra, just like
>> count(*) in sql.
>> What can I do ? Thank you.
>>
>> Sheng
>



-- 
http://twitter.com/jpartogi

Re: newbie question: how do I know the total number of rows of a cf?

Posted by Stephen Connolly <st...@gmail.com>.

iterate.

otherwise if that will be too slow and you will do it often, the nosql way
is to create a separate column family updated with each row add/delete to
hold the answer for you.

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 28 Mar 2011 07:40, "Sheng Chen" <ch...@gmail.com> wrote:
> Hi all,
> I want to know how many records I am holding in Cassandra, just like
> count(*) in sql.
> What can I do ? Thank you.
>
> Sheng