You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Arthur Zubarev <ar...@aol.com> on 2014/09/28 16:55:09 UTC

Indexes Fragmentation

 Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur

Re: Indexes Fragmentation

Posted by James Briggs <ja...@yahoo.com>.
MySQL Cluster (don't use FKs yet) or Redis (in-memory databases) sound more appropriate
for data that churns a lot.

 
Thanks, James Briggs. 
-- 
Cassandra/MySQL DBA. Available in San Jose area or remote. 
cass_top: https://github.com/jamesbriggs/cassandra-top



________________________________
 From: Robert Coli <rc...@eventbrite.com>
To: "user@cassandra.apache.org" <us...@cassandra.apache.org> 
Sent: Monday, September 29, 2014 5:01 PM
Subject: Re: Indexes Fragmentation
 





On Sun, Sep 28, 2014 at 9:49 AM, Arthur Zubarev <ar...@aol.com> wrote:
There are 200+ times more updates and 50x inserts than analytical loads.
>
>In Cassandra to just be able to query (in CQL) on a column I have to have an index, the question is what tall the fragmentation coming from the frequent updates and inserts has on a CF? Do I also need to manually defrug? 
>

You have appeared to have just asked if maintaing indexes which have a high rate of change in a log structured database with immutable data files is likely to be more performant than maintaining them in a database with modify-in-place semantics.

"No."

=Rob

Re: Indexes Fragmentation

Posted by Robert Coli <rc...@eventbrite.com>.
On Fri, Oct 3, 2014 at 6:03 PM, Arthur Zubarev <ar...@aol.com>
wrote:

> I now see I had misspelled the word tall for toll, anyways, if I
> understood correctly, your reply implies there is no impact whatsoever and
> there is no need to defrug indexes of the frequently changing columns.
>

"Cases with lots of secondary indexes which have a lot of churn are not
well suited for a database with immutable datafiles which wants to be
accessed by Primary Key."

The fragmentation is really bad, because the data files are immutable and
you have a lot of churn. Probably don't do it?

=Rob

Re: Indexes Fragmentation

Posted by Arthur Zubarev <ar...@aol.com>.
Hello Rob,
    
    I now see I had misspelled the word tall for toll, anyways, if I    understood correctly, your reply implies there is no impact    whatsoever and there is no need to defrug indexes of the frequently    changing columns. 

Am I right?
 
Thank you!

 

Regards,

Arthur

 

---- Original Message ----
From: Robert Coli <rc...@eventbrite.com>
To: user <us...@cassandra.apache.org>
Sent: Mon, Sep 29, 2014 8:01 pm
Subject: Re: Indexes Fragmentation



On Sun, Sep 28, 2014 at 9:49 AM, Arthur Zubarev <ar...@aol.com> wrote:
There are 200+ times more updates and 50x inserts than analytical loads.

In Cassandra to just be able to query (in CQL) on a column I have to have an index, the question is what tall the fragmentation coming from the frequent updates and inserts has on a CF? Do I also need to manually defrug? 




You have appeared to have just asked if maintaing indexes which have a high rate of change in a log structured database with immutable data files is likely to be more performant than maintaining them in a database with modify-in-place semantics.


"No."


=Rob





Re: Indexes Fragmentation

Posted by Robert Coli <rc...@eventbrite.com>.
On Sun, Sep 28, 2014 at 9:49 AM, Arthur Zubarev <ar...@aol.com>
wrote:
>
> There are 200+ times more updates and 50x inserts than analytical loads.
> In Cassandra to just be able to query (in CQL) on a column I have to have
> an index, the question is what tall the fragmentation coming from the
> frequent updates and inserts has on a CF? Do I also need to manually
> defrug?
>

You have appeared to have just asked if maintaing indexes which have a high
rate of change in a log structured database with immutable data files is
likely to be more performant than maintaining them in a database with
modify-in-place semantics.

"No."

=Rob

Re: Indexes Fragmentation

Posted by Arthur Zubarev <ar...@aol.com>.
The vendor application is not likely to change a tad.

There are 200+ times more updates and 50x inserts than analytical loads.

 
I can simply remove the indexes (in the RDBMS) and thus avoid the issue altogether, but I expect the analytical loads to suffer.

In Cassandra to just be able to query (in CQL) on a column I have to have an index, the question is what tall the fragmentation coming from the frequent updates and inserts has on a CF? Do I also need to manually defrug? 
Or it is more or less manageable?

 

Regards,

Arthur


 

 

---- Original Message ----
From: Jack Krupansky <ja...@basetechnology.com>
To: user <us...@cassandra.apache.org>
Sent: Sun, Sep 28, 2014 11:41 am
Subject: Re: Indexes Fragmentation




It’s always a tradeoff between the level of sophistication of the platform and how much work you want to do in the application itself.
 
But, yes, secondary indexing is always added overhead, and added complexity.
 
And index tables are a viable approach as well. Again, trading off a simpler platform for added complexity in the application.
 
Which way to go? As we say in data modeling, always start by looking at what queries and access patterns you expect to be using.
 
So, how many different ways do you expect to query?
 
Your original inquiry related to fragmentation due to heavy updates, but the background question remains how you intend to access that updated data? I mean, any perceived fragmentation may just be statistical noise compared to access efficiency overall.
 
-- Jack Krupansky

 

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 11:19 AM
To: user@cassandra.apache.org 
Subject: Re: Indexes Fragmentation

 

Thank you Jack,

But I am afraid it may be an overhead. Added complexity.

/Arthur

 
 
---- Original Message ----
From: Jack Krupansky <ja...@basetechnology.com>
To: user <us...@cassandra.apache.org>
Sent: Sun, Sep 28, 2014 11:03 am
Subject: Re: Indexes Fragmentation




Take a look at DataStax Enterprise as well, with its integrated Solr indexing of Cassandra data.
 
-- Jack Krupansky

 

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 10:55 AM
To: user@cassandra.apache.org 
Subject: Indexes Fragmentation

 

Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur 







Re: Indexes Fragmentation

Posted by Jack Krupansky <ja...@basetechnology.com>.
It’s always a tradeoff between the level of sophistication of the platform and how much work you want to do in the application itself.

But, yes, secondary indexing is always added overhead, and added complexity.

And index tables are a viable approach as well. Again, trading off a simpler platform for added complexity in the application.

Which way to go? As we say in data modeling, always start by looking at what queries and access patterns you expect to be using.

So, how many different ways do you expect to query?

Your original inquiry related to fragmentation due to heavy updates, but the background question remains how you intend to access that updated data? I mean, any perceived fragmentation may just be statistical noise compared to access efficiency overall.

-- Jack Krupansky

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 11:19 AM
To: user@cassandra.apache.org 
Subject: Re: Indexes Fragmentation

Thank you Jack,

But I am afraid it may be an overhead. Added complexity.

/Arthur



---- Original Message ----
From: Jack Krupansky <ja...@basetechnology.com>
To: user <us...@cassandra.apache.org>
Sent: Sun, Sep 28, 2014 11:03 am
Subject: Re: Indexes Fragmentation


Take a look at DataStax Enterprise as well, with its integrated Solr indexing of Cassandra data.

-- Jack Krupansky

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 10:55 AM
To: user@cassandra.apache.org 
Subject: Indexes Fragmentation

Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur 

Re: Indexes Fragmentation

Posted by Arthur Zubarev <ar...@aol.com>.
Thank you Jack,

But I am afraid it may be an overhead. Added complexity.

/Arthur



 

---- Original Message ----
From: Jack Krupansky <ja...@basetechnology.com>
To: user <us...@cassandra.apache.org>
Sent: Sun, Sep 28, 2014 11:03 am
Subject: Re: Indexes Fragmentation




Take a look at DataStax Enterprise as well, with its integrated Solr indexing of Cassandra data.
 
-- Jack Krupansky

 

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 10:55 AM
To: user@cassandra.apache.org 
Subject: Indexes Fragmentation

 

Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur 




Re: Indexes Fragmentation

Posted by Jack Krupansky <ja...@basetechnology.com>.
Take a look at DataStax Enterprise as well, with its integrated Solr indexing of Cassandra data.

-- Jack Krupansky

From: Arthur Zubarev 
Sent: Sunday, September 28, 2014 10:55 AM
To: user@cassandra.apache.org 
Subject: Indexes Fragmentation

Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur 

Re: Indexes Fragmentation

Posted by Arthur Zubarev <ar...@aol.com>.
More info: the RDBMS based db gets changed by writers in a vicinity of 40-50% of all data e.g. 100GB a week.
 The indexes can be defrugged, which is both expensive and time consuming. Many indexes become quickly out of date.

Not sure what you mean in retrospect to consistency against indexes.

I doubt there is room for bugs, this is a "by design" application behaviour.

 

Regards,

Arthur

 

---- Original Message ----
From: Hannu Kröger <hk...@gmail.com>
To: user <us...@cassandra.apache.org>
Sent: Sun, Sep 28, 2014 11:30 am
Subject: Re: Indexes Fragmentation



Hi,


I think more information is needed before this question can be answered. In many cases you manage the indexes by yourself. If that breaks, then you have a consistency problem or a bug in your own code. Consistency is tunable (trade off with performance and availability) and bugs can be fixed.  In any case, if you can shed a bit light on the use case, then it would be easier to answer your question.


Hannu


---------- Forwarded message ----------
From: Arthur Zubarev <ar...@aol.com>
Date: 2014-09-28 17:55 GMT+03:00
Subject: Indexes Fragmentation
To: user@cassandra.apache.org



 Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur




Re: Indexes Fragmentation

Posted by Hannu Kröger <hk...@gmail.com>.
Hi,

I think more information is needed before this question can be answered. In
many cases you manage the indexes by yourself. If that breaks, then you
have a consistency problem or a bug in your own code. Consistency is
tunable (trade off with performance and availability) and bugs can be
fixed.  In any case, if you can shed a bit light on the use case, then it
would be easier to answer your question.

Hannu

---------- Forwarded message ----------
From: Arthur Zubarev <ar...@aol.com>
Date: 2014-09-28 17:55 GMT+03:00
Subject: Indexes Fragmentation
To: user@cassandra.apache.org


 Hi all:

A client on a RDBMS faces quick index fragmentations, statistics become
inaccurate. Many within 4 hours (fast updates + writes, but mostly updates).

I am looking into replacing the RDBMS with Cassandra.

Will I face the same issue with indexes with Cassandra?

Thank you!

Regards,

Arthur