You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Bartłomiej Romański <br...@sentia.pl> on 2012/09/18 03:46:39 UTC

are counters stable enough for production?

Hi,

Does anyone have any experience with using Cassandra counters in production?

We rely heavily on them and recently we've got a few very serious
problems. Our counters values suddenly became a few times higher than
expected. From the business point of view this is a disaster :/ Also
there a few open major bugs related to them. Some of them for quite
long (months).

We are seriously considering going back to other solutions (e.g. SQL
databases). We simply cannot afford incorrect counter values. We can
tolerate loosing a few increments from time to time, but we cannot
tolerate having counters suddenly 3 times higher or lower than the
expected values.

What is the current status of counters? Should I consider them a
production-ready feature and we just have some bad luck? Or should I
rather consider them as a experimental-feature and look for some other
solutions?

Do you have any experiences with them? Any comments would be very
helpful for us!

Thanks,
Bartek

Re: are counters stable enough for production?

Posted by Robin Verlangen <ro...@us2.nl>.

" To go further, would it maybe be an idea to count everything twice? One
as postive value and once as negative value. When reading the counters, the
application could just compare the negative and positive counter to get an
error margin. "

This sounds interesting. Maybe someone should implement this, maybe even
backed with a mysql mirror and compare the results after a week of full
load.

" I assume most people would rather have an under- than an overcount. "

This is probably 100% dependent on your use case. If you're counting errors
on an application that you really bother you would prefer an overcount over
an undercount. If you're counting profit you would not want to measure more
than you actually made. However in all cases you'll want the rights
results: spot on without loss of precision.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/18 horschi <ho...@gmail.com>

> "The repair of taking the highest value of two inconsistent might cause
> getting higher values?"
>
>
> If a counter counts backwards (therefore has negative values), then repair
> would still choose the larger value? Or does cassandra take the highter
> absolute value?  This would result to an undercounting in case of an error
> instead of an overcount.
>
> To go further, would it maybe be an idea to count everything twice? One as
> postive value and once as negative value. When reading the counters, the
> application could just compare the negative and positive counter to get an
> error margin.
>
> Has anybody tried something like this? I assume most people would rather
> have an under- than an overcount.
>
> cheers,
> Christian
>
>
>

Re: are counters stable enough for production?

Posted by horschi <ho...@gmail.com>.

"The repair of taking the highest value of two inconsistent might cause
getting higher values?"


If a counter counts backwards (therefore has negative values), then repair
would still choose the larger value? Or does cassandra take the highter
absolute value?  This would result to an undercounting in case of an error
instead of an overcount.

To go further, would it maybe be an idea to count everything twice? One as
postive value and once as negative value. When reading the counters, the
application could just compare the negative and positive counter to get an
error margin.

Has anybody tried something like this? I assume most people would rather
have an under- than an overcount.

cheers,
Christian

Re: are counters stable enough for production?

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

"The repair of taking the highest value of two inconsistent might cause
getting higher values?"

I guess so:

>From Sylvain Lebresne: '"Now what the code does to "repair" in that case is
to pick the higher of the two value it has. But honestly that's random, there's
a 50/50 chance that it will pick the right value."

So this repair has 50% chance to lead into an over-count.

"Maybe even much higher values if it takes place multiple times?"

I don't understand this bug very well. So I can't really answer this.
However, I never had over-counts that were so big that their were
unacceptable maybe because I got a few (from 0 to 30) of these errors every
day out of between hundred of thousandths and a few millions increments
every day.

Alain

Re: are counters stable enough for production?

Posted by Robin Verlangen <ro...@us2.nl>.

@Alain: " If you don't have much time to read this, just know that it's a
random error, which appear with low frequency, but regularly, seems to
appear quite randomly, and nobody knows the reason why it appears yet.
Also, you need to know that it's repaired by taking the highest of the
two inconsistent values."

I was aware of that. The repair of taking the highest value of two
inconsistent might cause getting higher values? Maybe even much higher
values if it takes place multiple times?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/18 Alain RODRIGUEZ <ar...@gmail.com>

> Hi
>
> @Robin, about the log message:
>
> "Sometimes you can see log messages that indicate that counters are out of
> sync in the cluster and they get "repaired". My guess would be that the
> repairs actually destroys it, however I have no knowledge of the underlying
> techniques. "
>
> Here you got an answer form Sylvain Lebresne who, if I understood it well,
> is in charge of cassandra counters and many other things.
>
>
> http://grokbase.com/t/cassandra/user/125zr6n1q9/invalid-counter-shard-errors/1296as2dpw#1296as2dpw
>
>
> If you don't have much time to read this, just know that it's a random
> error, which appear with low frequency, but regularly, seems to appear
> quite randomly, and nobody knows the reason why it appears yet. Also, you
> need to know that it's repaired by taking the highest of the
> two inconsistent values.
>
> @Bartek, about the counters fails:
>
> We are in a similar case, we can't afford wrong values, even more knowing
> that we track the same information in different ways and we can't afford
> showing different values for the same thing to our customers...
>
> We had a lot of trouble at start using counters (counts of period
> vanishing, increasing (x2 or x3 or randomly)... We finely archive to
> get something stable. We still have some over-counts but it's nothing big
> (it's like a 0.01% error that we can afford). It's more we could replay
> some logs to rebuild our counters, but we don't do it yet, maybe someday...
>
> We had trouble at start because of 2 things:
>
> - Hardware not powerfull enough (we started with t1.micro from amazon then
> small, medium, and now we use m1.large)
> - Wrong configuration of cassandra/phpCassa (overall cassandra...) we
> learnt a lot before getting a stable cluster, finally.
>
> Get JNA installed, enough heap memory, increase timeouts in your cassandra
> client (overcount is often due to timeouts, themselves often produced by
> cpu highly loaded), get your cpu load down (getting more memory or
> configuring it well, eventually tuning compaction_throughput_mb_per_sec and
> disabling multithreaded_compaction)... I can't tell you much more about
> config because there is a lot of different things you can do to improve
> Cassandra performances and I'm not an expert :/.
>
> Hope it may help somehow and you'll find out what's wrong.
>
> Alain
>
> 2012/9/18 rohit bhatia <ro...@gmail.com>
>
>> @Robin
>> I'm pretty sure the GC issue is due to counters only. Since we have
>> only write-heavy counter incrementing traffic.
>> GC Frequency also increases linearly with write load.
>>
>> @Bartlomiej
>> On Stress Testing, we see GC frequency and consequently write latency
>> increase to several milliseconds.
>> At 50k qps we had GC running every 1-2 second. And since each Parnew
>> takes around 100ms, we were spending 10% of each server's time GCing.
>>
>> Also, we don't have persistent connections, but testing with
>> persistent connections give roughly the same result.
>>
>> At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young
>> Gen GC running on each node every 4 seconds (approximately).
>> We have a young gen heap size of 3200M which is already too big by any
>> standards.
>>
>> Also decreasing Replication factor from 2 to 1 reduced the GC
>> frequency 5-6 times.
>>
>> Any Advice?
>>
>> Also, our traffic is evenly
>> On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen <ro...@us2.nl> wrote:
>> > We've not been trying to create inconsistencies as you describe above.
>> But
>> > it seems legit that those situations cause problems.
>> >
>> > Sometimes you can see log messages that indicate that counters are out
>> of
>> > sync in the cluster and they get "repaired". My guess would be that the
>> > repairs actually destroys it, however I have no knowledge of the
>> underlying
>> > techniques. I think this because of the fact that those read repairs
>> happen
>> > a lot (as you mention: lots of reads) and might get over-repaired or
>> > something? However, this is all just a guess. I hope someone with a lot
>> > knowledge about Cassandra internals can shed some light on this.
>> >
>> > Best regards,
>> >
>> > Robin Verlangen
>> > Software engineer
>> >
>> > W http://www.robinverlangen.nl
>> > E robin@us2.nl
>> >
>> > Disclaimer: The information contained in this message and attachments is
>> > intended solely for the attention and use of the named addressee and
>> may be
>> > confidential. If you are not the intended recipient, you are reminded
>> that
>> > the information remains the property of the sender. You must not use,
>> > disclose, distribute, copy, print or rely on this e-mail. If you have
>> > received this message in error, please contact the sender immediately
>> and
>> > irrevocably delete this message and any copies.
>> >
>> >
>> >
>> > 2012/9/18 Bartłomiej Romański <br...@sentia.pl>
>> >>
>> >> Garbage is one more issue we are having with counters. We are
>> >> operating under very heavy load. Counters are spread over 7 nodes with
>> >> SSD drives and we often seeing CPU usage between 90-100%. We are doing
>> >> mostly reads. Latency is very important for us so GC pauses taking
>> >> longer than 10ms (often around 50-100ms) are very annoying.
>> >>
>> >> I don't have actual numbers right now, but we've also got the
>> >> impressions that cassandra generates "too much" garbage. Is there a
>> >> possible that counters are somehow guilty?
>> >>
>> >> @Rohit: Did you tried something more stressful? Like sending more
>> >> traffic to a node that it can actually handle, turning nodes up and
>> >> down, changing the topology (moving/adding nodes)? I believe our
>> >> problems comes from very high load and some operations like this
>> >> (adding new nodes, replacing dead ones etc...). I was expecting that
>> >> cassandra will fail some request, loose consistency temporarily or
>> >> something like that in such cases, but generation highly incorrect
>> >> values was very disappointing.
>> >>
>> >> Thanks,
>> >> Bartek
>> >>
>> >>
>> >> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <ro...@us2.nl> wrote:
>> >> > @Rohit: We also use counters quite a lot (lets say 2000 increments /
>> >> > sec),
>> >> > but don't see the 50-100KB of garbage per increment. Are you sure
>> that
>> >> > memory is coming from your counters?
>> >> >
>> >> > Best regards,
>> >> >
>> >> > Robin Verlangen
>> >> > Software engineer
>> >> >
>> >> > W http://www.robinverlangen.nl
>> >> > E robin@us2.nl
>> >> >
>> >> > Disclaimer: The information contained in this message and
>> attachments is
>> >> > intended solely for the attention and use of the named addressee and
>> may
>> >> > be
>> >> > confidential. If you are not the intended recipient, you are reminded
>> >> > that
>> >> > the information remains the property of the sender. You must not use,
>> >> > disclose, distribute, copy, print or rely on this e-mail. If you have
>> >> > received this message in error, please contact the sender immediately
>> >> > and
>> >> > irrevocably delete this message and any copies.
>> >> >
>> >> >
>> >> >
>> >> > 2012/9/18 rohit bhatia <ro...@gmail.com>
>> >> >>
>> >> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
>> >> >> We use phpcassa and execute cql queries through thrift to work with
>> >> >> composite types.
>> >> >>
>> >> >> We do not have any problem of overcounts as we tally with RDBMS
>> daily.
>> >> >>
>> >> >> It works fine but we are having some GC pressure for young
>> generation.
>> >> >> Per my calculation around 50-100 KB of garbage is generated every
>> >> >> counter increment.
>> >> >> Is this memory usage expected of counters?
>> >> >>
>> >> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl>
>> >> >> wrote:
>> >> >> > Hi,
>> >> >> >
>> >> >> > Does anyone have any experience with using Cassandra counters in
>> >> >> > production?
>> >> >> >
>> >> >> > We rely heavily on them and recently we've got a few very serious
>> >> >> > problems. Our counters values suddenly became a few times higher
>> than
>> >> >> > expected. From the business point of view this is a disaster :/
>> Also
>> >> >> > there a few open major bugs related to them. Some of them for
>> quite
>> >> >> > long (months).
>> >> >> >
>> >> >> > We are seriously considering going back to other solutions (e.g.
>> SQL
>> >> >> > databases). We simply cannot afford incorrect counter values. We
>> can
>> >> >> > tolerate loosing a few increments from time to time, but we cannot
>> >> >> > tolerate having counters suddenly 3 times higher or lower than the
>> >> >> > expected values.
>> >> >> >
>> >> >> > What is the current status of counters? Should I consider them a
>> >> >> > production-ready feature and we just have some bad luck? Or
>> should I
>> >> >> > rather consider them as a experimental-feature and look for some
>> >> >> > other
>> >> >> > solutions?
>> >> >> >
>> >> >> > Do you have any experiences with them? Any comments would be very
>> >> >> > helpful for us!
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Bartek
>> >> >
>> >> >
>> >
>> >
>>
>
>

Re: are counters stable enough for production?

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi

@Robin, about the log message:

"Sometimes you can see log messages that indicate that counters are out of
sync in the cluster and they get "repaired". My guess would be that the
repairs actually destroys it, however I have no knowledge of the underlying
techniques. "

Here you got an answer form Sylvain Lebresne who, if I understood it well,
is in charge of cassandra counters and many other things.

http://grokbase.com/t/cassandra/user/125zr6n1q9/invalid-counter-shard-errors/1296as2dpw#1296as2dpw


If you don't have much time to read this, just know that it's a random
error, which appear with low frequency, but regularly, seems to appear
quite randomly, and nobody knows the reason why it appears yet. Also, you
need to know that it's repaired by taking the highest of the
two inconsistent values.

@Bartek, about the counters fails:

We are in a similar case, we can't afford wrong values, even more knowing
that we track the same information in different ways and we can't afford
showing different values for the same thing to our customers...

We had a lot of trouble at start using counters (counts of period
vanishing, increasing (x2 or x3 or randomly)... We finely archive to
get something stable. We still have some over-counts but it's nothing big
(it's like a 0.01% error that we can afford). It's more we could replay
some logs to rebuild our counters, but we don't do it yet, maybe someday...

We had trouble at start because of 2 things:

- Hardware not powerfull enough (we started with t1.micro from amazon then
small, medium, and now we use m1.large)
- Wrong configuration of cassandra/phpCassa (overall cassandra...) we
learnt a lot before getting a stable cluster, finally.

Get JNA installed, enough heap memory, increase timeouts in your cassandra
client (overcount is often due to timeouts, themselves often produced by
cpu highly loaded), get your cpu load down (getting more memory or
configuring it well, eventually tuning compaction_throughput_mb_per_sec and
disabling multithreaded_compaction)... I can't tell you much more about
config because there is a lot of different things you can do to improve
Cassandra performances and I'm not an expert :/.

Hope it may help somehow and you'll find out what's wrong.

Alain

2012/9/18 rohit bhatia <ro...@gmail.com>

> @Robin
> I'm pretty sure the GC issue is due to counters only. Since we have
> only write-heavy counter incrementing traffic.
> GC Frequency also increases linearly with write load.
>
> @Bartlomiej
> On Stress Testing, we see GC frequency and consequently write latency
> increase to several milliseconds.
> At 50k qps we had GC running every 1-2 second. And since each Parnew
> takes around 100ms, we were spending 10% of each server's time GCing.
>
> Also, we don't have persistent connections, but testing with
> persistent connections give roughly the same result.
>
> At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young
> Gen GC running on each node every 4 seconds (approximately).
> We have a young gen heap size of 3200M which is already too big by any
> standards.
>
> Also decreasing Replication factor from 2 to 1 reduced the GC
> frequency 5-6 times.
>
> Any Advice?
>
> Also, our traffic is evenly
> On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen <ro...@us2.nl> wrote:
> > We've not been trying to create inconsistencies as you describe above.
> But
> > it seems legit that those situations cause problems.
> >
> > Sometimes you can see log messages that indicate that counters are out of
> > sync in the cluster and they get "repaired". My guess would be that the
> > repairs actually destroys it, however I have no knowledge of the
> underlying
> > techniques. I think this because of the fact that those read repairs
> happen
> > a lot (as you mention: lots of reads) and might get over-repaired or
> > something? However, this is all just a guess. I hope someone with a lot
> > knowledge about Cassandra internals can shed some light on this.
> >
> > Best regards,
> >
> > Robin Verlangen
> > Software engineer
> >
> > W http://www.robinverlangen.nl
> > E robin@us2.nl
> >
> > Disclaimer: The information contained in this message and attachments is
> > intended solely for the attention and use of the named addressee and may
> be
> > confidential. If you are not the intended recipient, you are reminded
> that
> > the information remains the property of the sender. You must not use,
> > disclose, distribute, copy, print or rely on this e-mail. If you have
> > received this message in error, please contact the sender immediately and
> > irrevocably delete this message and any copies.
> >
> >
> >
> > 2012/9/18 Bartłomiej Romański <br...@sentia.pl>
> >>
> >> Garbage is one more issue we are having with counters. We are
> >> operating under very heavy load. Counters are spread over 7 nodes with
> >> SSD drives and we often seeing CPU usage between 90-100%. We are doing
> >> mostly reads. Latency is very important for us so GC pauses taking
> >> longer than 10ms (often around 50-100ms) are very annoying.
> >>
> >> I don't have actual numbers right now, but we've also got the
> >> impressions that cassandra generates "too much" garbage. Is there a
> >> possible that counters are somehow guilty?
> >>
> >> @Rohit: Did you tried something more stressful? Like sending more
> >> traffic to a node that it can actually handle, turning nodes up and
> >> down, changing the topology (moving/adding nodes)? I believe our
> >> problems comes from very high load and some operations like this
> >> (adding new nodes, replacing dead ones etc...). I was expecting that
> >> cassandra will fail some request, loose consistency temporarily or
> >> something like that in such cases, but generation highly incorrect
> >> values was very disappointing.
> >>
> >> Thanks,
> >> Bartek
> >>
> >>
> >> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <ro...@us2.nl> wrote:
> >> > @Rohit: We also use counters quite a lot (lets say 2000 increments /
> >> > sec),
> >> > but don't see the 50-100KB of garbage per increment. Are you sure that
> >> > memory is coming from your counters?
> >> >
> >> > Best regards,
> >> >
> >> > Robin Verlangen
> >> > Software engineer
> >> >
> >> > W http://www.robinverlangen.nl
> >> > E robin@us2.nl
> >> >
> >> > Disclaimer: The information contained in this message and attachments
> is
> >> > intended solely for the attention and use of the named addressee and
> may
> >> > be
> >> > confidential. If you are not the intended recipient, you are reminded
> >> > that
> >> > the information remains the property of the sender. You must not use,
> >> > disclose, distribute, copy, print or rely on this e-mail. If you have
> >> > received this message in error, please contact the sender immediately
> >> > and
> >> > irrevocably delete this message and any copies.
> >> >
> >> >
> >> >
> >> > 2012/9/18 rohit bhatia <ro...@gmail.com>
> >> >>
> >> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
> >> >> We use phpcassa and execute cql queries through thrift to work with
> >> >> composite types.
> >> >>
> >> >> We do not have any problem of overcounts as we tally with RDBMS
> daily.
> >> >>
> >> >> It works fine but we are having some GC pressure for young
> generation.
> >> >> Per my calculation around 50-100 KB of garbage is generated every
> >> >> counter increment.
> >> >> Is this memory usage expected of counters?
> >> >>
> >> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl>
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > Does anyone have any experience with using Cassandra counters in
> >> >> > production?
> >> >> >
> >> >> > We rely heavily on them and recently we've got a few very serious
> >> >> > problems. Our counters values suddenly became a few times higher
> than
> >> >> > expected. From the business point of view this is a disaster :/
> Also
> >> >> > there a few open major bugs related to them. Some of them for quite
> >> >> > long (months).
> >> >> >
> >> >> > We are seriously considering going back to other solutions (e.g.
> SQL
> >> >> > databases). We simply cannot afford incorrect counter values. We
> can
> >> >> > tolerate loosing a few increments from time to time, but we cannot
> >> >> > tolerate having counters suddenly 3 times higher or lower than the
> >> >> > expected values.
> >> >> >
> >> >> > What is the current status of counters? Should I consider them a
> >> >> > production-ready feature and we just have some bad luck? Or should
> I
> >> >> > rather consider them as a experimental-feature and look for some
> >> >> > other
> >> >> > solutions?
> >> >> >
> >> >> > Do you have any experiences with them? Any comments would be very
> >> >> > helpful for us!
> >> >> >
> >> >> > Thanks,
> >> >> > Bartek
> >> >
> >> >
> >
> >
>

Re: are counters stable enough for production?

Posted by rohit bhatia <ro...@gmail.com>.

@Robin
I'm pretty sure the GC issue is due to counters only. Since we have
only write-heavy counter incrementing traffic.
GC Frequency also increases linearly with write load.

@Bartlomiej
On Stress Testing, we see GC frequency and consequently write latency
increase to several milliseconds.
At 50k qps we had GC running every 1-2 second. And since each Parnew
takes around 100ms, we were spending 10% of each server's time GCing.

Also, we don't have persistent connections, but testing with
persistent connections give roughly the same result.

At a traffic of roughly 20k qps for 8 nodes with RF 2, we have Young
Gen GC running on each node every 4 seconds (approximately).
We have a young gen heap size of 3200M which is already too big by any
standards.

Also decreasing Replication factor from 2 to 1 reduced the GC
frequency 5-6 times.

Any Advice?

Also, our traffic is evenly
On Tue, Sep 18, 2012 at 1:36 PM, Robin Verlangen <ro...@us2.nl> wrote:
> We've not been trying to create inconsistencies as you describe above. But
> it seems legit that those situations cause problems.
>
> Sometimes you can see log messages that indicate that counters are out of
> sync in the cluster and they get "repaired". My guess would be that the
> repairs actually destroys it, however I have no knowledge of the underlying
> techniques. I think this because of the fact that those read repairs happen
> a lot (as you mention: lots of reads) and might get over-repaired or
> something? However, this is all just a guess. I hope someone with a lot
> knowledge about Cassandra internals can shed some light on this.
>
> Best regards,
>
> Robin Verlangen
> Software engineer
>
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/18 Bartłomiej Romański <br...@sentia.pl>
>>
>> Garbage is one more issue we are having with counters. We are
>> operating under very heavy load. Counters are spread over 7 nodes with
>> SSD drives and we often seeing CPU usage between 90-100%. We are doing
>> mostly reads. Latency is very important for us so GC pauses taking
>> longer than 10ms (often around 50-100ms) are very annoying.
>>
>> I don't have actual numbers right now, but we've also got the
>> impressions that cassandra generates "too much" garbage. Is there a
>> possible that counters are somehow guilty?
>>
>> @Rohit: Did you tried something more stressful? Like sending more
>> traffic to a node that it can actually handle, turning nodes up and
>> down, changing the topology (moving/adding nodes)? I believe our
>> problems comes from very high load and some operations like this
>> (adding new nodes, replacing dead ones etc...). I was expecting that
>> cassandra will fail some request, loose consistency temporarily or
>> something like that in such cases, but generation highly incorrect
>> values was very disappointing.
>>
>> Thanks,
>> Bartek
>>
>>
>> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <ro...@us2.nl> wrote:
>> > @Rohit: We also use counters quite a lot (lets say 2000 increments /
>> > sec),
>> > but don't see the 50-100KB of garbage per increment. Are you sure that
>> > memory is coming from your counters?
>> >
>> > Best regards,
>> >
>> > Robin Verlangen
>> > Software engineer
>> >
>> > W http://www.robinverlangen.nl
>> > E robin@us2.nl
>> >
>> > Disclaimer: The information contained in this message and attachments is
>> > intended solely for the attention and use of the named addressee and may
>> > be
>> > confidential. If you are not the intended recipient, you are reminded
>> > that
>> > the information remains the property of the sender. You must not use,
>> > disclose, distribute, copy, print or rely on this e-mail. If you have
>> > received this message in error, please contact the sender immediately
>> > and
>> > irrevocably delete this message and any copies.
>> >
>> >
>> >
>> > 2012/9/18 rohit bhatia <ro...@gmail.com>
>> >>
>> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
>> >> We use phpcassa and execute cql queries through thrift to work with
>> >> composite types.
>> >>
>> >> We do not have any problem of overcounts as we tally with RDBMS daily.
>> >>
>> >> It works fine but we are having some GC pressure for young generation.
>> >> Per my calculation around 50-100 KB of garbage is generated every
>> >> counter increment.
>> >> Is this memory usage expected of counters?
>> >>
>> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > Does anyone have any experience with using Cassandra counters in
>> >> > production?
>> >> >
>> >> > We rely heavily on them and recently we've got a few very serious
>> >> > problems. Our counters values suddenly became a few times higher than
>> >> > expected. From the business point of view this is a disaster :/ Also
>> >> > there a few open major bugs related to them. Some of them for quite
>> >> > long (months).
>> >> >
>> >> > We are seriously considering going back to other solutions (e.g. SQL
>> >> > databases). We simply cannot afford incorrect counter values. We can
>> >> > tolerate loosing a few increments from time to time, but we cannot
>> >> > tolerate having counters suddenly 3 times higher or lower than the
>> >> > expected values.
>> >> >
>> >> > What is the current status of counters? Should I consider them a
>> >> > production-ready feature and we just have some bad luck? Or should I
>> >> > rather consider them as a experimental-feature and look for some
>> >> > other
>> >> > solutions?
>> >> >
>> >> > Do you have any experiences with them? Any comments would be very
>> >> > helpful for us!
>> >> >
>> >> > Thanks,
>> >> > Bartek
>> >
>> >
>
>

Re: are counters stable enough for production?

Posted by Robin Verlangen <ro...@us2.nl>.

We've not been trying to create inconsistencies as you describe above. But
it seems legit that those situations cause problems.

Sometimes you can see log messages that indicate that counters are out of
sync in the cluster and they get "repaired". My guess would be that the
repairs actually destroys it, however I have no knowledge of the underlying
techniques. I think this because of the fact that those read repairs happen
a lot (as you mention: lots of reads) and might get over-repaired or
something? However, this is all just a guess. I hope someone with a lot
knowledge about Cassandra internals can shed some light on this.

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/18 Bartłomiej Romański <br...@sentia.pl>

> Garbage is one more issue we are having with counters. We are
> operating under very heavy load. Counters are spread over 7 nodes with
> SSD drives and we often seeing CPU usage between 90-100%. We are doing
> mostly reads. Latency is very important for us so GC pauses taking
> longer than 10ms (often around 50-100ms) are very annoying.
>
> I don't have actual numbers right now, but we've also got the
> impressions that cassandra generates "too much" garbage. Is there a
> possible that counters are somehow guilty?
>
> @Rohit: Did you tried something more stressful? Like sending more
> traffic to a node that it can actually handle, turning nodes up and
> down, changing the topology (moving/adding nodes)? I believe our
> problems comes from very high load and some operations like this
> (adding new nodes, replacing dead ones etc...). I was expecting that
> cassandra will fail some request, loose consistency temporarily or
> something like that in such cases, but generation highly incorrect
> values was very disappointing.
>
> Thanks,
> Bartek
>
>
> On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <ro...@us2.nl> wrote:
> > @Rohit: We also use counters quite a lot (lets say 2000 increments /
> sec),
> > but don't see the 50-100KB of garbage per increment. Are you sure that
> > memory is coming from your counters?
> >
> > Best regards,
> >
> > Robin Verlangen
> > Software engineer
> >
> > W http://www.robinverlangen.nl
> > E robin@us2.nl
> >
> > Disclaimer: The information contained in this message and attachments is
> > intended solely for the attention and use of the named addressee and may
> be
> > confidential. If you are not the intended recipient, you are reminded
> that
> > the information remains the property of the sender. You must not use,
> > disclose, distribute, copy, print or rely on this e-mail. If you have
> > received this message in error, please contact the sender immediately and
> > irrevocably delete this message and any copies.
> >
> >
> >
> > 2012/9/18 rohit bhatia <ro...@gmail.com>
> >>
> >> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
> >> We use phpcassa and execute cql queries through thrift to work with
> >> composite types.
> >>
> >> We do not have any problem of overcounts as we tally with RDBMS daily.
> >>
> >> It works fine but we are having some GC pressure for young generation.
> >> Per my calculation around 50-100 KB of garbage is generated every
> >> counter increment.
> >> Is this memory usage expected of counters?
> >>
> >> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl>
> wrote:
> >> > Hi,
> >> >
> >> > Does anyone have any experience with using Cassandra counters in
> >> > production?
> >> >
> >> > We rely heavily on them and recently we've got a few very serious
> >> > problems. Our counters values suddenly became a few times higher than
> >> > expected. From the business point of view this is a disaster :/ Also
> >> > there a few open major bugs related to them. Some of them for quite
> >> > long (months).
> >> >
> >> > We are seriously considering going back to other solutions (e.g. SQL
> >> > databases). We simply cannot afford incorrect counter values. We can
> >> > tolerate loosing a few increments from time to time, but we cannot
> >> > tolerate having counters suddenly 3 times higher or lower than the
> >> > expected values.
> >> >
> >> > What is the current status of counters? Should I consider them a
> >> > production-ready feature and we just have some bad luck? Or should I
> >> > rather consider them as a experimental-feature and look for some other
> >> > solutions?
> >> >
> >> > Do you have any experiences with them? Any comments would be very
> >> > helpful for us!
> >> >
> >> > Thanks,
> >> > Bartek
> >
> >
>

Re: are counters stable enough for production?

Posted by Bartłomiej Romański <br...@sentia.pl>.

Garbage is one more issue we are having with counters. We are
operating under very heavy load. Counters are spread over 7 nodes with
SSD drives and we often seeing CPU usage between 90-100%. We are doing
mostly reads. Latency is very important for us so GC pauses taking
longer than 10ms (often around 50-100ms) are very annoying.

I don't have actual numbers right now, but we've also got the
impressions that cassandra generates "too much" garbage. Is there a
possible that counters are somehow guilty?

@Rohit: Did you tried something more stressful? Like sending more
traffic to a node that it can actually handle, turning nodes up and
down, changing the topology (moving/adding nodes)? I believe our
problems comes from very high load and some operations like this
(adding new nodes, replacing dead ones etc...). I was expecting that
cassandra will fail some request, loose consistency temporarily or
something like that in such cases, but generation highly incorrect
values was very disappointing.

Thanks,
Bartek


On Tue, Sep 18, 2012 at 9:30 AM, Robin Verlangen <ro...@us2.nl> wrote:
> @Rohit: We also use counters quite a lot (lets say 2000 increments / sec),
> but don't see the 50-100KB of garbage per increment. Are you sure that
> memory is coming from your counters?
>
> Best regards,
>
> Robin Verlangen
> Software engineer
>
> W http://www.robinverlangen.nl
> E robin@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/9/18 rohit bhatia <ro...@gmail.com>
>>
>> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
>> We use phpcassa and execute cql queries through thrift to work with
>> composite types.
>>
>> We do not have any problem of overcounts as we tally with RDBMS daily.
>>
>> It works fine but we are having some GC pressure for young generation.
>> Per my calculation around 50-100 KB of garbage is generated every
>> counter increment.
>> Is this memory usage expected of counters?
>>
>> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl> wrote:
>> > Hi,
>> >
>> > Does anyone have any experience with using Cassandra counters in
>> > production?
>> >
>> > We rely heavily on them and recently we've got a few very serious
>> > problems. Our counters values suddenly became a few times higher than
>> > expected. From the business point of view this is a disaster :/ Also
>> > there a few open major bugs related to them. Some of them for quite
>> > long (months).
>> >
>> > We are seriously considering going back to other solutions (e.g. SQL
>> > databases). We simply cannot afford incorrect counter values. We can
>> > tolerate loosing a few increments from time to time, but we cannot
>> > tolerate having counters suddenly 3 times higher or lower than the
>> > expected values.
>> >
>> > What is the current status of counters? Should I consider them a
>> > production-ready feature and we just have some bad luck? Or should I
>> > rather consider them as a experimental-feature and look for some other
>> > solutions?
>> >
>> > Do you have any experiences with them? Any comments would be very
>> > helpful for us!
>> >
>> > Thanks,
>> > Bartek
>
>

Re: are counters stable enough for production?

Posted by Robin Verlangen <ro...@us2.nl>.

@Rohit: We also use counters quite a lot (lets say 2000 increments / sec),
but don't see the 50-100KB of garbage per increment. Are you sure that
memory is coming from your counters?

Best regards,

Robin Verlangen
*Software engineer*
*
*
W http://www.robinverlangen.nl
E robin@us2.nl

Disclaimer: The information contained in this message and attachments is
intended solely for the attention and use of the named addressee and may be
confidential. If you are not the intended recipient, you are reminded that
the information remains the property of the sender. You must not use,
disclose, distribute, copy, print or rely on this e-mail. If you have
received this message in error, please contact the sender immediately and
irrevocably delete this message and any copies.



2012/9/18 rohit bhatia <ro...@gmail.com>

> We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
> We use phpcassa and execute cql queries through thrift to work with
> composite types.
>
> We do not have any problem of overcounts as we tally with RDBMS daily.
>
> It works fine but we are having some GC pressure for young generation.
> Per my calculation around 50-100 KB of garbage is generated every
> counter increment.
> Is this memory usage expected of counters?
>
> On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl> wrote:
> > Hi,
> >
> > Does anyone have any experience with using Cassandra counters in
> production?
> >
> > We rely heavily on them and recently we've got a few very serious
> > problems. Our counters values suddenly became a few times higher than
> > expected. From the business point of view this is a disaster :/ Also
> > there a few open major bugs related to them. Some of them for quite
> > long (months).
> >
> > We are seriously considering going back to other solutions (e.g. SQL
> > databases). We simply cannot afford incorrect counter values. We can
> > tolerate loosing a few increments from time to time, but we cannot
> > tolerate having counters suddenly 3 times higher or lower than the
> > expected values.
> >
> > What is the current status of counters? Should I consider them a
> > production-ready feature and we just have some bad luck? Or should I
> > rather consider them as a experimental-feature and look for some other
> > solutions?
> >
> > Do you have any experiences with them? Any comments would be very
> > helpful for us!
> >
> > Thanks,
> > Bartek
>

Re: are counters stable enough for production?

Posted by rohit bhatia <ro...@gmail.com>.

We use counters in a 8 node cluster with RF 2 in cassandra 1.0.5.
We use phpcassa and execute cql queries through thrift to work with
composite types.

We do not have any problem of overcounts as we tally with RDBMS daily.

It works fine but we are having some GC pressure for young generation.
Per my calculation around 50-100 KB of garbage is generated every
counter increment.
Is this memory usage expected of counters?

On Tue, Sep 18, 2012 at 7:16 AM, Bartłomiej Romański <br...@sentia.pl> wrote:
> Hi,
>
> Does anyone have any experience with using Cassandra counters in production?
>
> We rely heavily on them and recently we've got a few very serious
> problems. Our counters values suddenly became a few times higher than
> expected. From the business point of view this is a disaster :/ Also
> there a few open major bugs related to them. Some of them for quite
> long (months).
>
> We are seriously considering going back to other solutions (e.g. SQL
> databases). We simply cannot afford incorrect counter values. We can
> tolerate loosing a few increments from time to time, but we cannot
> tolerate having counters suddenly 3 times higher or lower than the
> expected values.
>
> What is the current status of counters? Should I consider them a
> production-ready feature and we just have some bad luck? Or should I
> rather consider them as a experimental-feature and look for some other
> solutions?
>
> Do you have any experiences with them? Any comments would be very
> helpful for us!
>
> Thanks,
> Bartek