You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by onlinespending <on...@gmail.com> on 2013/11/25 22:18:49 UTC

Inefficiency with large set of small documents?

I’m trying to decide what noSQL database to use, and I’ve certainly decided against mongodb due to its use of mmap. I’m wondering if Cassandra would also suffer from a similar inefficiency with small documents. In mongodb, if you have a large set of small documents (each much less than the 4KB page size) you will require far more RAM to fit your working set into memory, since a large percentage of a 4KB chunk could very easily include infrequently accessed data outside of your working set. Cassandra doesn’t use mmap, but it would still have to intelligently discard the excess data that does not pertain to a small document that exists in the same allocation unit on the hard disk when reading it into RAM. As an example lets say your cluster size is 4KB as well, and you have 1000 small 256 byte documents that are scattered on the disk that you want to fetch on a given query (the total number of documents is over 1 billion). I want to make sure it only consumes roughly 256,000 bytes for those 1000 documents and not 4,096,000 bytes. When it first fetches a cluster from disk it may consume 4KB of cache, but it should ultimately only ideally consume the relevant amount of bytes in RAM. If Cassandra just indiscriminately uses RAM in 4KB blocks than that is unacceptable to me, because if my working set at any given time is just 20% of my huge collection of small sized documents, I don’t want to have to use servers with 5X as much RAM. That’s a huge expense.

Thanks,
Ben

P.S. Here’s a detailed post I made this morning in the mongodb user group about this topic.

People have often complained that because mongodb memory maps everything and leaves memory management to the OS's virtual memory system, the swapping algorithm isn't optimized for database usage. I disagree with this. For the most part, the swapping or paging algorithm itself can't be much better than the sophisticated algorithms (such as LRU based ones) that OSes have refined over many years. Why reinvent the wheel? Yes, you could potentially ensure that certain data (such as the indexes) never get swapped out to disk, because even if they haven't been accessed recently the cost of reading them back into memory will be too costly when they are in fact needed. But that's not the bigger issue.

It breaks down with small documents << than page size

This is where using virtual memory for everything really becomes an issue. Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that are much smaller than the virtual memory page size (e.g. 4KB). Now let's say that you've determined that your working set (e.g. those documents in your collection that constitute say 99% of those accessed in a given hour) to be 20GB. But your entire collection size is actually 100GB (it's just that 20% of your documents are much much likely to be accessed in a given time period. It's not uncommon that a small minority of documents will be accessed a large majority of the time). If your collection is randomly distributed (such as would happen if you simply inserted new documents into your collection) then in this example only about 20% of the documents that fit onto a 4KB page will be part of the working set (i.e. the data that you need frequent access to at the moment). The rest of the data will be made up of much less frequently accessed documents, that should ideally be sitting on disk. So there's a huge inefficiency here. 80% of the data that is in RAM is not even something I need to frequently access. In this example, I would need 5X the amount of RAM to accommodate my working set.

Now, as a solution to this problem, you could separate your documents into two (or even a few) collections with the grouping done by access frequency. The problem with this, is that your working set can often change as a function of time of day and day of week. If your application is global, your working set will be far different during 12pm local in NY vs 12pm local in Tokyo. But more even more likely is that the working set is constantly changing as new data is inserted into the database. Popularity of a document is often viral. As an example, a photo that's posted on a social network may start off infrequently accessed but then quickly after hundreds of "likes" could become very frequently accessed and part of your working set. You'd need to actively monitor your documents and manually move a document from one collection to the other, which is very inefficient.

Quite frankly this is not a burden that should be placed on the user anyways. By punting the problem of memory management to the OS, mongodb requires the user to essentially do its job and group data in a way that patches the inefficiencies in its memory management. As far as I'm concerned, not until mongodb steps up and takes control of memory management can it be taken seriously for very large datasets that often require many small documents with ever changing working sets.





Re: Inefficiency with large set of small documents?

Posted by Robert Wille <rw...@fold3.com>.
I recently created a test database with about 400 million small records. The
disk space consumed was about 30 GB, or 75 bytes per record.

From:  onlinespending <on...@gmail.com>
Reply-To:  <us...@cassandra.apache.org>
Date:  Monday, November 25, 2013 at 2:18 PM
To:  <us...@cassandra.apache.org>
Subject:  Inefficiency with large set of small documents?

I¹m trying to decide what noSQL database to use, and I¹ve certainly decided
against mongodb due to its use of mmap. I¹m wondering if Cassandra would
also suffer from a similar inefficiency with small documents. In mongodb, if
you have a large set of small documents (each much less than the 4KB page
size) you will require far more RAM to fit your working set into memory,
since a large percentage of a 4KB chunk could very easily include
infrequently accessed data outside of your working set. Cassandra doesn¹t
use mmap, but it would still have to intelligently discard the excess data
that does not pertain to a small document that exists in the same allocation
unit on the hard disk when reading it into RAM. As an example lets say your
cluster size is 4KB as well, and you have 1000 small 256 byte documents that
are scattered on the disk that you want to fetch on a given query (the total
number of documents is over 1 billion). I want to make sure it only consumes
roughly 256,000 bytes for those 1000 documents and not 4,096,000 bytes. When
it first fetches a cluster from disk it may consume 4KB of cache, but it
should ultimately only ideally consume the relevant amount of bytes in RAM.
If Cassandra just indiscriminately uses RAM in 4KB blocks than that is
unacceptable to me, because if my working set at any given time is just 20%
of my huge collection of small sized documents, I don¹t want to have to use
servers with 5X as much RAM. That¹s a huge expense.

Thanks,
Ben

P.S. Here¹s a detailed post I made this morning in the mongodb user group
about this topic.

People have often complained that because mongodb memory maps everything and
leaves memory management to the OS's virtual memory system, the swapping
algorithm isn't optimized for database usage. I disagree with this. For the
most part, the swapping or paging algorithm itself can't be much better than
the sophisticated algorithms (such as LRU based ones) that OSes have refined
over many years. Why reinvent the wheel? Yes, you could potentially ensure
that certain data (such as the indexes) never get swapped out to disk,
because even if they haven't been accessed recently the cost of reading them
back into memory will be too costly when they are in fact needed. But that's
not the bigger issue.

It breaks down with small documents << than page size

This is where using virtual memory for everything really becomes an issue.
Suppose you've got a bunch of really tiny documents (e.g. ~256 bytes) that
are much smaller than the virtual memory page size (e.g. 4KB). Now let's say
that you've determined that your working set (e.g. those documents in your
collection that constitute say 99% of those accessed in a given hour) to be
20GB. But your entire collection size is actually 100GB (it's just that 20%
of your documents are much much likely to be accessed in a given time
period. It's not uncommon that a small minority of documents will be
accessed a large majority of the time). If your collection is randomly
distributed (such as would happen if you simply inserted new documents into
your collection) then in this example only about 20% of the documents that
fit onto a 4KB page will be part of the working set (i.e. the data that you
need frequent access to at the moment). The rest of the data will be made up
of much less frequently accessed documents, that should ideally be sitting
on disk. So there's a huge inefficiency here. 80% of the data that is in RAM
is not even something I need to frequently access. In this example, I would
need 5X the amount of RAM to accommodate my working set.

Now, as a solution to this problem, you could separate your documents into
two (or even a few) collections with the grouping done by access frequency.
The problem with this, is that your working set can often change as a
function of time of day and day of week. If your application is global, your
working set will be far different during 12pm local in NY vs 12pm local in
Tokyo. But more even more likely is that the working set is constantly
changing as new data is inserted into the database. Popularity of a document
is often viral. As an example, a photo that's posted on a social network may
start off infrequently accessed but then quickly after hundreds of "likes"
could become very frequently accessed and part of your working set. You'd
need to actively monitor your documents and manually move a document from
one collection to the other, which is very inefficient.

Quite frankly this is not a burden that should be placed on the user
anyways. By punting the problem of memory management to the OS, mongodb
requires the user to essentially do its job and group data in a way that
patches the inefficiencies in its memory management. As far as I'm
concerned, not until mongodb steps up and takes control of memory management
can it be taken seriously for very large datasets that often require many
small documents with ever changing working sets.