You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Richard West <ri...@clearchaos.com> on 2010/05/27 03:52:44 UTC

Cassandra's 2GB row limit and indexing

Hi all,

I'm currently looking at new database options for a URL shortener in order
to scale well with increased traffic as we add new features. Cassandra seems
to be a good fit for many of our requirements, but I'm struggling a bit to
find ways of designing certain indexes in Cassandra due to its 2GB row
limit.

The easiest example of this is that I'd like to create an index by the
domain that shortened URLs are linking to, mostly for spam control so it's
easy to grab all the links to any given domain. As far as I can tell the
typical way to do this in Cassandra is something like: -

DOMAIN = { //columnfamily
    thing.com { //row key
        timestamp: "shorturl567", //column name: value
        timestamp: "shorturl144",
        timestamp: "shorturl112",
        ...
    }
    somethingelse.com {
        timestamp: "shorturl817",
        ...
    }
}

The values here are keys for another columnfamily containing various data on
shortened URLs.

The problem with this approach is that a popular domain (e.g. blogspot.com)
could be used in many millions of shortened URLs, so would have that many
columns and hit the row size limit mentioned at
http://wiki.apache.org/cassandra/CassandraLimitations.

Does anyone know an effective way to design this type of one-to-many index
around this limitation (could be something obvious I'm missing)? If not, are
the changes proposed for
https://issues.apache.org/jira/browse/CASSANDRA-16likely to make this
type of design workable?

Thanks in advance for any advice,

Richard

Re: Cassandra's 2GB row limit and indexing

Posted by Jonathan Ellis <jb...@gmail.com>.
Yes, #16 (which is almost done for 0.7) will make this possible.

On Wed, May 26, 2010 at 7:52 PM, Richard West <ri...@clearchaos.com> wrote:
> Hi all,
>
> I'm currently looking at new database options for a URL shortener in order
> to scale well with increased traffic as we add new features. Cassandra seems
> to be a good fit for many of our requirements, but I'm struggling a bit to
> find ways of designing certain indexes in Cassandra due to its 2GB row
> limit.
>
> The easiest example of this is that I'd like to create an index by the
> domain that shortened URLs are linking to, mostly for spam control so it's
> easy to grab all the links to any given domain. As far as I can tell the
> typical way to do this in Cassandra is something like: -
>
> DOMAIN = { //columnfamily
>     thing.com { //row key
>         timestamp: "shorturl567", //column name: value
>         timestamp: "shorturl144",
>         timestamp: "shorturl112",
>         ...
>     }
>     somethingelse.com {
>         timestamp: "shorturl817",
>         ...
>     }
> }
>
> The values here are keys for another columnfamily containing various data on
> shortened URLs.
>
> The problem with this approach is that a popular domain (e.g. blogspot.com)
> could be used in many millions of shortened URLs, so would have that many
> columns and hit the row size limit mentioned at
> http://wiki.apache.org/cassandra/CassandraLimitations.
>
> Does anyone know an effective way to design this type of one-to-many index
> around this limitation (could be something obvious I'm missing)? If not, are
> the changes proposed for https://issues.apache.org/jira/browse/CASSANDRA-16
> likely to make this type of design workable?
>
> Thanks in advance for any advice,
>
> Richard
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Re: Cassandra's 2GB row limit and indexing

Posted by Jonathan Shook <js...@gmail.com>.
The example is a little confusing.
.. but ..

1) "sharding"
You can square the capacity by having a 2-level map.
 CF1->row->value->CF2->row->value
 This means finding some natural subgrouping or hash that provides a
good distribution.
2)  "hashing"
You can also use some additional key hashing to spread the rows over a
wider space:
 Find a delimiter that works for you and identify the row that owns it
by "domain" + "delimiter" + hash(domain) modulo some divisor, for
example.
3) "overflow"
You can implement some overflow logic to create overflow rows which
act like (2), but is less sparse
 while count(columns) for candidate row > some threshold, try row +
"delimiter" + subrow++
 This is much easier when you are streaming data in, as opposed to
poking the random value here and there

Just some ideas. I'd go with 2, and find a way to adjust the modulo to
minimize the row spread. 2) isn't guaranteed to provide uniformity,
but 3) isn't guaranteed to provide very good performance. Perhaps a
combination of them both? The count is readily accessible, so it may
provide for some informed choices at run time. I'm assuming your
column sizes are fairly predictable.

Has anybody else tackled this before?


On Wed, May 26, 2010 at 8:52 PM, Richard West <ri...@clearchaos.com> wrote:
> Hi all,
>
> I'm currently looking at new database options for a URL shortener in order
> to scale well with increased traffic as we add new features. Cassandra seems
> to be a good fit for many of our requirements, but I'm struggling a bit to
> find ways of designing certain indexes in Cassandra due to its 2GB row
> limit.
>
> The easiest example of this is that I'd like to create an index by the
> domain that shortened URLs are linking to, mostly for spam control so it's
> easy to grab all the links to any given domain. As far as I can tell the
> typical way to do this in Cassandra is something like: -
>
> DOMAIN = { //columnfamily
>     thing.com { //row key
>         timestamp: "shorturl567", //column name: value
>         timestamp: "shorturl144",
>         timestamp: "shorturl112",
>         ...
>     }
>     somethingelse.com {
>         timestamp: "shorturl817",
>         ...
>     }
> }
>
> The values here are keys for another columnfamily containing various data on
> shortened URLs.
>
> The problem with this approach is that a popular domain (e.g. blogspot.com)
> could be used in many millions of shortened URLs, so would have that many
> columns and hit the row size limit mentioned at
> http://wiki.apache.org/cassandra/CassandraLimitations.
>
> Does anyone know an effective way to design this type of one-to-many index
> around this limitation (could be something obvious I'm missing)? If not, are
> the changes proposed for https://issues.apache.org/jira/browse/CASSANDRA-16
> likely to make this type of design workable?
>
> Thanks in advance for any advice,
>
> Richard
>