You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Chamila Wijayarathna <cd...@gmail.com> on 2014/12/14 13:01:02 UTC

Cassandra Database using too much space

Hello all,

We are trying to develop a language corpus by using Cassandra as its
storage medium.

https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types
of information we need to extract from corpus interface.
So we designed schema at
https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
database. Out target is to develop corpus with 100+ million words.

By now we have inserted about 1.5 million words and database has used about
14GB space. Is this a normal scenario or are we doing anything wrong? Is
there any issue in our data model?

Thank You!
-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.

Re: Cassandra Database using too much space

Posted by Chamila Wijayarathna <cd...@gmail.com>.

Hi Ryan,

Thank you very much. This helps a lot.

On Sun, Dec 14, 2014 at 9:14 PM, Ryan Svihla <rs...@datastax.com> wrote:
>
> Well your data model looks fine at a glance, a lot of tables, but they
> appear to be mapping to logically obvious query paths. This denormalization
> will make your queries fast but eat up more disk, and if disk is really a
> pain point, Id suggest looking at your economics a bit, and look at your
> tradeoffs.
>
>
>    1. If you want less disk usage, and can afford to have longer query
>    times, switch from denormalized views and use indexes instead, you'll get
>    better disk space savings, at the cost of more round trips on a read (read
>    index value..get partition key, do another read).
>    2. If you really need queries to be as fast as possible, then you're
>    on the right path, but you'll have to realize this is the cost of scale.
>    With even relational databases in the past I've had to use a similar
>    strategy to speed up lookups (less different query parameters in that case
>    and more queries that would normally require lots of joins).
>
> Hope this helps explain tradeoffs and costs.
>
> On Sun, Dec 14, 2014 at 6:01 AM, Chamila Wijayarathna <
> cdwijayarathna@gmail.com> wrote:
>>
>> Hello all,
>>
>> We are trying to develop a language corpus by using Cassandra as its
>> storage medium.
>>
>> https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
>> types of information we need to extract from corpus interface.
>> So we designed schema at
>> https://gist.github.com/cdwijayarathna/6491122063152669839f to use as
>> the database. Out target is to develop corpus with 100+ million words.
>>
>> By now we have inserted about 1.5 million words and database has used
>> about 14GB space. Is this a normal scenario or are we doing anything wrong?
>> Is there any issue in our data model?
>>
>> Thank You!
>> --
>> *Chamila Dilshan Wijayarathna,*
>> SMIEEE, SMIESL,
>> Undergraduate,
>> Department of Computer Science and Engineering,
>> University of Moratuwa.
>>
>
>
> --
>
> [image: datastax_logo.png] <http://www.datastax.com/>
>
> Ryan Svihla
>
> Solution Architect
>
> [image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
> <http://www.linkedin.com/pub/ryan-svihla/12/621/727/>
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
>

-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.

Re: Cassandra Database using too much space

Posted by Ryan Svihla <rs...@datastax.com>.

Well your data model looks fine at a glance, a lot of tables, but they
appear to be mapping to logically obvious query paths. This denormalization
will make your queries fast but eat up more disk, and if disk is really a
pain point, Id suggest looking at your economics a bit, and look at your
tradeoffs.

   1. If you want less disk usage, and can afford to have longer query
   times, switch from denormalized views and use indexes instead, you'll get
   better disk space savings, at the cost of more round trips on a read (read
   index value..get partition key, do another read).
   2. If you really need queries to be as fast as possible, then you're on
   the right path, but you'll have to realize this is the cost of scale. With
   even relational databases in the past I've had to use a similar strategy to
   speed up lookups (less different query parameters in that case and more
   queries that would normally require lots of joins).

Hope this helps explain tradeoffs and costs.

On Sun, Dec 14, 2014 at 6:01 AM, Chamila Wijayarathna <
cdwijayarathna@gmail.com> wrote:
>
> Hello all,
>
> We are trying to develop a language corpus by using Cassandra as its
> storage medium.
>
> https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
> types of information we need to extract from corpus interface.
> So we designed schema at
> https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
> database. Out target is to develop corpus with 100+ million words.
>
> By now we have inserted about 1.5 million words and database has used
> about 14GB space. Is this a normal scenario or are we doing anything wrong?
> Is there any issue in our data model?
>
> Thank You!
> --
> *Chamila Dilshan Wijayarathna,*
> SMIEEE, SMIESL,
> Undergraduate,
> Department of Computer Science and Engineering,
> University of Moratuwa.
>

-- 

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan Svihla

Solution Architect

[image: twitter.png] <https://twitter.com/foundev> [image: linkedin.png]
<http://www.linkedin.com/pub/ryan-svihla/12/621/727/>

DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

Re: Cassandra Database using too much space

Posted by Jack Krupansky <ja...@basetechnology.com>.

I also meant to point out that you have to be careful with very wide partitions, like those where the partition key is the year, with all usages for that year. Thousands of rows in a partition is probably okay, but millions could become problematic. 100MB for a single partition is a reasonable limit – beyond that you need to start using “buckets” to break up ultra-large partitions.

Also, you need to look carefully at how you want to query each table.

-- Jack Krupansky

From: Chamila Wijayarathna 
Sent: Sunday, December 14, 2014 11:36 PM
To: user@cassandra.apache.org 
Subject: Re: Cassandra Database using too much space

Hi Jack , 

Thanks for replying.

Here what I meant by 1.5M words is not 1.5 Distincts words, it is the count of all words we added to the corpus (total word instances). Then in word_frequency and word_ordered_frequency CFs, we have a row for each distinct word with its frequency (two CFs have same data with different indexing). Also we keep frequencies year wise ,category wise (newspaper, magazine, fiction, etc.) and position where word occur in a sentence. So the distinct word count will be probably about 0.2M. We don't keep any details in frequency table where frequency is 0. So word 'abc' may only have rows for year 2014 and 2010 if it only used in those years.

In bigram and trigram ables, we do not store all possible combinations of words, we only store bigrams/trigrams that occur in resources we have considered. In word_usage table we have a entry for each word, that means 1.5M rows with the context details where the word has been used. Same happens in bigrams and trigrams as well.

Here we used separate column families word_usage, word_year_usage, word_Category_usage with same details, since we have to search in 4 scenarios, using 
  1.. year, 

  2.. category, 

  3.. year&category, 

  4.. none

inside WHERE clause and also order them by date. They contain same data but different indexing. Same goes with bigram and trigram CFs.

We update frequencies while entering words to database. So for every word instances we add, we either insert a new row or update a existing row. In some cases where we use frequency as clustering index, since we can't update frequency, we delete entire row and add new row with updated frequency. [1] is the client we used for inserting data.

I am very new to Cassandra and I may have done lot of bad things in modeling and implementing this database. Please let me know if there is anything wrong here.

Thank You!

1. https://github.com/DImuthuUpe/DBFeederMvn/blob/master/src/main/java/com/sinmin/corpus/cassandra/CassandraClient.java

On Mon, Dec 15, 2014 at 1:46 AM, Jack Krupansky <ja...@basetechnology.com> wrote: 
  It looks like you will have quite a few “combinatoric explosions” to cope with. In addition to 1.5M words,  you have bigrams – combinations of two and three words. You need to get a handle on the cardinality of each of your tables. Bigrams and trigrams could give you who knows how many millions more rows than the 1.5M word frequency rows.

  And then you have word, bigram, and trigram frequencies by year as well, meaning take the counts from above and multiply by the number of years in your corpus!

  And then you have word, bigram, and triagram “usage”  - and by year as well. Is that every unique sentence from the corpus? Either way, this is an incredible combinatoric explosion.

  And then there is category and position, which I didn’t look at since you didn’t specify what exactly they are. Once again, start with a focus on cardinality of the data.

  In short, just as a thought experiment, say that your 1.5M words expanded into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes per row, which may be a bit more than desired, but not totally unreasonable. And maybe the explosion is more like 30 to 1, which would give like 333 bytes per row, which seems quite reasonable.

  Also, are you doing heavy updates, for each word (and bigram and trigram) as each occurrence is encountered in the corpus or are you counting things in memory and then only writing each row once after the full corpus has been read?

  Also, what is the corpus size – total word instances, both for the full corpus and for the subset containing your 1.5 million words?

  -- Jack Krupansky

  From: Chamila Wijayarathna 
  Sent: Sunday, December 14, 2014 7:01 AM
  To: user@cassandra.apache.org 
  Subject: Cassandra Database using too much space

  Hello all, 

  We are trying to develop a language corpus by using Cassandra as its storage medium.

  https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types of information we need to extract from corpus interface. 

  So we designed schema at https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the database. Out target is to develop corpus with 100+ million words.

  By now we have inserted about 1.5 million words and database has used about 14GB space. Is this a normal scenario or are we doing anything wrong? Is there any issue in our data model?

  Thank You!
  -- 

  Chamila Dilshan Wijayarathna,
  SMIEEE, SMIESL,
  Undergraduate,
  Department of Computer Science and Engineering,
  University of Moratuwa.

-- 

Chamila Dilshan Wijayarathna,
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.

Re: Cassandra Database using too much space

Posted by Chamila Wijayarathna <cd...@gmail.com>.

Hi Jack ,

Thanks for replying.

Here what I meant by 1.5M words is not 1.5 Distincts words, it is the count
of all words we added to the corpus (total word instances). Then in
word_frequency and word_ordered_frequency CFs, we have a row for each
distinct word with its frequency (two CFs have same data with different
indexing). Also we keep frequencies year wise ,category wise (newspaper,
magazine, fiction, etc.) and position where word occur in a sentence. So
the distinct word count will be probably about 0.2M. We don't keep any
details in frequency table where frequency is 0. So word 'abc' may only
have rows for year 2014 and 2010 if it only used in those years.

In bigram and trigram ables, we do not store all possible combinations of
words, we only store bigrams/trigrams that occur in resources we have
considered. In word_usage table we have a entry for each word, that means
1.5M rows with the context details where the word has been used. Same
happens in bigrams and trigrams as well.

Here we used separate column families word_usage, word_year_usage,
word_Category_usage with same details, since we have to search in 4
scenarios, using

   1. year,
   2. category,
   3. year&category,
   4. none

 inside WHERE clause and also order them by date. They contain same data
but different indexing. Same goes with bigram and trigram CFs.

We update frequencies while entering words to database. So for every word
instances we add, we either insert a new row or update a existing row. In
some cases where we use frequency as clustering index, since we can't
update frequency, we delete entire row and add new row with updated
frequency. [1] is the client we used for inserting data.

I am very new to Cassandra and I may have done lot of bad things in
modeling and implementing this database. Please let me know if there is
anything wrong here.

Thank You!

1.
https://github.com/DImuthuUpe/DBFeederMvn/blob/master/src/main/java/com/sinmin/corpus/cassandra/CassandraClient.java

On Mon, Dec 15, 2014 at 1:46 AM, Jack Krupansky <ja...@basetechnology.com>
wrote:
>
>   It looks like you will have quite a few “combinatoric explosions” to
> cope with. In addition to 1.5M words,  you have bigrams – combinations of
> two and three words. You need to get a handle on the cardinality of each of
> your tables. Bigrams and trigrams could give you who knows how many
> millions more rows than the 1.5M word frequency rows.
>
> And then you have word, bigram, and trigram frequencies by year as well,
> meaning take the counts from above and multiply by the number of years in
> your corpus!
>
> And then you have word, bigram, and triagram “usage”  - and by year as
> well. Is that every unique sentence from the corpus? Either way, this is an
> incredible combinatoric explosion.
>
> And then there is category and position, which I didn’t look at since you
> didn’t specify what exactly they are. Once again, start with a focus on
> cardinality of the data.
>
> In short, just as a thought experiment, say that your 1.5M words expanded
> into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes
> per row, which may be a bit more than desired, but not totally
> unreasonable. And maybe the explosion is more like 30 to 1, which would
> give like 333 bytes per row, which seems quite reasonable.
>
> Also, are you doing heavy updates, for each word (and bigram and trigram)
> as each occurrence is encountered in the corpus or are you counting things
> in memory and then only writing each row once after the full corpus has
> been read?
>
> Also, what is the corpus size – total word instances, both for the full
> corpus and for the subset containing your 1.5 million words?
>
> -- Jack Krupansky
>
>  *From:* Chamila Wijayarathna <cd...@gmail.com>
> *Sent:* Sunday, December 14, 2014 7:01 AM
> *To:* user@cassandra.apache.org
> *Subject:* Cassandra Database using too much space
>
>  Hello all,
>
> We are trying to develop a language corpus by using Cassandra as its
> storage medium.
>
> https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the
> types of information we need to extract from corpus interface.
> So we designed schema at
> https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the
> database. Out target is to develop corpus with 100+ million words.
>
> By now we have inserted about 1.5 million words and database has used
> about 14GB space. Is this a normal scenario or are we doing anything wrong?
> Is there any issue in our data model?
>
> Thank You!
> --
> *Chamila Dilshan Wijayarathna,*
> SMIEEE, SMIESL,
> Undergraduate,
> Department of Computer Science and Engineering,
> University of Moratuwa.
>

-- 
*Chamila Dilshan Wijayarathna,*
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.

Re: Cassandra Database using too much space

Posted by Jack Krupansky <ja...@basetechnology.com>.

It looks like you will have quite a few “combinatoric explosions” to cope with. In addition to 1.5M words,  you have bigrams – combinations of two and three words. You need to get a handle on the cardinality of each of your tables. Bigrams and trigrams could give you who knows how many millions more rows than the 1.5M word frequency rows.

And then you have word, bigram, and trigram frequencies by year as well, meaning take the counts from above and multiply by the number of years in your corpus!

And then you have word, bigram, and triagram “usage”  - and by year as well. Is that every unique sentence from the corpus? Either way, this is an incredible combinatoric explosion.

And then there is category and position, which I didn’t look at since you didn’t specify what exactly they are. Once again, start with a focus on cardinality of the data.

In short, just as a thought experiment, say that your 1.5M words expanded into 15M rows, divide that into 15Gbytes and that would give you 1000 bytes per row, which may be a bit more than desired, but not totally unreasonable. And maybe the explosion is more like 30 to 1, which would give like 333 bytes per row, which seems quite reasonable.

Also, are you doing heavy updates, for each word (and bigram and trigram) as each occurrence is encountered in the corpus or are you counting things in memory and then only writing each row once after the full corpus has been read?

Also, what is the corpus size – total word instances, both for the full corpus and for the subset containing your 1.5 million words?

-- Jack Krupansky

From: Chamila Wijayarathna 
Sent: Sunday, December 14, 2014 7:01 AM
To: user@cassandra.apache.org 
Subject: Cassandra Database using too much space

Hello all, 

We are trying to develop a language corpus by using Cassandra as its storage medium.

https://gist.github.com/cdwijayarathna/7550176443ad2229fae0 shows the types of information we need to extract from corpus interface. 

So we designed schema at https://gist.github.com/cdwijayarathna/6491122063152669839f to use as the database. Out target is to develop corpus with 100+ million words.

By now we have inserted about 1.5 million words and database has used about 14GB space. Is this a normal scenario or are we doing anything wrong? Is there any issue in our data model?

Thank You!
-- 

Chamila Dilshan Wijayarathna,
SMIEEE, SMIESL,
Undergraduate,
Department of Computer Science and Engineering,
University of Moratuwa.