You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Wicked J <wi...@gmail.com> on 2010/10/15 20:14:59 UTC

Recommended sort mechanism and partitioner

Hi,
I'm using TimeUUID/Sort by column name mechanism. The column value can
contain text data (in future they may contain image data as well) leading to
the possibility of a row out-growing the RAM capacity. Given this background
my questions are:

a] How many columns are recommended against one row? Based on my app. needs,
I can imagine having 10 million would be a good starting point for the
max_limit (based on text data). Also note that my app. will use search in
ranges of 100 or 200 columns when there are large number of records(columnar
data) without a caching solution in the front.
b] What partitioner is recommended? so that the load in the cluster nodes is
not largely uneven.
c] Would you recommend changing the TimeUUID/Columnar sort mechanism (with a
change in the data model) to sort using row key mechanism? If so then what
partitioner is recommended?  with load not being largely uneven.

Thanks

Re: Recommended sort mechanism and partitioner

Posted by Tyler Hobbs <ty...@riptano.com>.
i) Yes

ii) Well, so you don't actually want to use version 1 UUIDs for keys here.
Although
they mostly increase in byte order over time, it's only for the first 8
bytes.  Instead,
you can use something like:

'timestamp-foo'

Where 'foo' might be a randomly generated string or something unique per
client.

You could also use 'YYYYMMDDSSmmm' instead of the timestamp if that makes
queries easier for you.

- Tyler

On Fri, Oct 15, 2010 at 6:22 PM, Wicked J <wi...@gmail.com> wrote:

> Tyler,
> Thanks for answering my question. Can you please clarify on point (c)?
>
> i] Are you saying that if I move to second row (identified by a rowKey in
> Cassandra) after I hit 10 million  col. values for 1st row, only then the
> second row will be written to a new node in the cluster?  meaning all the 10
> million column values within the first row (rowKey) until then have been
> written to one and the same node regardless of the # of nodes in the
> cluster.
>
> ii] Assume I change my data model to the one in below (CF1) with a
> "OrderPreservingPartitioner" then would I be able to read data in the order
> inserted? Because my understanding is TimeUUID values cannot be inserted for
> row Keys based on the Thrift API in v0.6.4 i.e. from the insert method in
> Cassandra.Client or am I missing something?
>
> CF1:
>
> Key: '1'
>   name: colname, value: 'First Inserted', timestamp: 1287165326492
> Key: '2'
>   name: colname, value: 'Second Inserted', timestamp: 1287165326523
>
> Thanks!
>
>
> On Fri, Oct 15, 2010 at 12:18 PM, Tyler Hobbs <ty...@riptano.com> wrote:
>
>> a) 10 mil sounds fine.  Just watch out for compaction. Huge rows can kill
>> you there,
>> from my understanding.
>>
>> b) Use RandomPartitioner unless you absolutely have to use something else.
>>
>> c) If you're inserting all along one row and only moving to another row
>> when you
>> hit 10 mil, you're only going to be writing to one node at a time.  In
>> this sense,
>> you might want to consider using the TimeUUID as a row key instead.
>> There's
>> not really a problem with having tons of rows in a column family.
>>
>> If you want to be able to get a slice of time with this scheme, you can
>> either use
>> an order preserving partitioner or have a second column family with an
>> index
>> row (or rows) sorted by TimeUUID. (This sounds like what you're
>> suggesting.)
>>
>> - Tyler
>>
>>
>> I wrote some thoughts about this on my blog. I think it's still mostly
>>> correct:
>>>
>>>  * http://www.ayogo.com/techblog/2010/04/sorting-in-cassandra/
>>>
>>> On Fri, Oct 15, 2010 at 11:14 AM, Wicked J <wi...@gmail.com>
>>> wrote:
>>> > Hi,
>>> > I'm using TimeUUID/Sort by column name mechanism. The column value can
>>> > contain text data (in future they may contain image data as well)
>>> leading to
>>> > the possibility of a row out-growing the RAM capacity. Given this
>>> background
>>> > my questions are:
>>> >
>>> > a] How many columns are recommended against one row? Based on my app.
>>> needs,
>>> > I can imagine having 10 million would be a good starting point for the
>>> > max_limit (based on text data). Also note that my app. will use search
>>> in
>>> > ranges of 100 or 200 columns when there are large number of
>>> records(columnar
>>> > data) without a caching solution in the front.
>>> > b] What partitioner is recommended? so that the load in the cluster
>>> nodes is
>>> > not largely uneven.
>>> > c] Would you recommend changing the TimeUUID/Columnar sort mechanism
>>> (with a
>>> > change in the data model) to sort using row key mechanism? If so then
>>> what
>>> > partitioner is recommended?  with load not being largely uneven.
>>> >
>>> > Thanks
>>> >
>>>
>>
>>
>

Re: Recommended sort mechanism and partitioner

Posted by Wicked J <wi...@gmail.com>.
Tyler,
Thanks for answering my question. Can you please clarify on point (c)?

i] Are you saying that if I move to second row (identified by a rowKey in
Cassandra) after I hit 10 million  col. values for 1st row, only then the
second row will be written to a new node in the cluster?  meaning all the 10
million column values within the first row (rowKey) until then have been
written to one and the same node regardless of the # of nodes in the
cluster.

ii] Assume I change my data model to the one in below (CF1) with a
"OrderPreservingPartitioner" then would I be able to read data in the order
inserted? Because my understanding is TimeUUID values cannot be inserted for
row Keys based on the Thrift API in v0.6.4 i.e. from the insert method in
Cassandra.Client or am I missing something?

CF1:

Key: '1'
  name: colname, value: 'First Inserted', timestamp: 1287165326492
Key: '2'
  name: colname, value: 'Second Inserted', timestamp: 1287165326523

Thanks!

On Fri, Oct 15, 2010 at 12:18 PM, Tyler Hobbs <ty...@riptano.com> wrote:

> a) 10 mil sounds fine.  Just watch out for compaction. Huge rows can kill
> you there,
> from my understanding.
>
> b) Use RandomPartitioner unless you absolutely have to use something else.
>
> c) If you're inserting all along one row and only moving to another row
> when you
> hit 10 mil, you're only going to be writing to one node at a time.  In this
> sense,
> you might want to consider using the TimeUUID as a row key instead.
> There's
> not really a problem with having tons of rows in a column family.
>
> If you want to be able to get a slice of time with this scheme, you can
> either use
> an order preserving partitioner or have a second column family with an
> index
> row (or rows) sorted by TimeUUID. (This sounds like what you're
> suggesting.)
>
> - Tyler
>
>
> I wrote some thoughts about this on my blog. I think it's still mostly
>> correct:
>>
>>  * http://www.ayogo.com/techblog/2010/04/sorting-in-cassandra/
>>
>> On Fri, Oct 15, 2010 at 11:14 AM, Wicked J <wi...@gmail.com> wrote:
>> > Hi,
>> > I'm using TimeUUID/Sort by column name mechanism. The column value can
>> > contain text data (in future they may contain image data as well)
>> leading to
>> > the possibility of a row out-growing the RAM capacity. Given this
>> background
>> > my questions are:
>> >
>> > a] How many columns are recommended against one row? Based on my app.
>> needs,
>> > I can imagine having 10 million would be a good starting point for the
>> > max_limit (based on text data). Also note that my app. will use search
>> in
>> > ranges of 100 or 200 columns when there are large number of
>> records(columnar
>> > data) without a caching solution in the front.
>> > b] What partitioner is recommended? so that the load in the cluster
>> nodes is
>> > not largely uneven.
>> > c] Would you recommend changing the TimeUUID/Columnar sort mechanism
>> (with a
>> > change in the data model) to sort using row key mechanism? If so then
>> what
>> > partitioner is recommended?  with load not being largely uneven.
>> >
>> > Thanks
>> >
>>
>
>

Re: Recommended sort mechanism and partitioner

Posted by Tyler Hobbs <ty...@riptano.com>.
a) 10 mil sounds fine.  Just watch out for compaction. Huge rows can kill
you there,
from my understanding.

b) Use RandomPartitioner unless you absolutely have to use something else.

c) If you're inserting all along one row and only moving to another row when
you
hit 10 mil, you're only going to be writing to one node at a time.  In this
sense,
you might want to consider using the TimeUUID as a row key instead.  There's
not really a problem with having tons of rows in a column family.

If you want to be able to get a slice of time with this scheme, you can
either use
an order preserving partitioner or have a second column family with an index
row (or rows) sorted by TimeUUID. (This sounds like what you're suggesting.)

- Tyler

I wrote some thoughts about this on my blog. I think it's still mostly
> correct:
>
>  * http://www.ayogo.com/techblog/2010/04/sorting-in-cassandra/
>
> On Fri, Oct 15, 2010 at 11:14 AM, Wicked J <wi...@gmail.com> wrote:
> > Hi,
> > I'm using TimeUUID/Sort by column name mechanism. The column value can
> > contain text data (in future they may contain image data as well) leading
> to
> > the possibility of a row out-growing the RAM capacity. Given this
> background
> > my questions are:
> >
> > a] How many columns are recommended against one row? Based on my app.
> needs,
> > I can imagine having 10 million would be a good starting point for the
> > max_limit (based on text data). Also note that my app. will use search in
> > ranges of 100 or 200 columns when there are large number of
> records(columnar
> > data) without a caching solution in the front.
> > b] What partitioner is recommended? so that the load in the cluster nodes
> is
> > not largely uneven.
> > c] Would you recommend changing the TimeUUID/Columnar sort mechanism
> (with a
> > change in the data model) to sort using row key mechanism? If so then
> what
> > partitioner is recommended?  with load not being largely uneven.
> >
> > Thanks
> >
>

Re: Recommended sort mechanism and partitioner

Posted by Paul Prescod <pa...@prescod.net>.
I wrote some thoughts about this on my blog. I think it's still mostly correct:

 * http://www.ayogo.com/techblog/2010/04/sorting-in-cassandra/

On Fri, Oct 15, 2010 at 11:14 AM, Wicked J <wi...@gmail.com> wrote:
> Hi,
> I'm using TimeUUID/Sort by column name mechanism. The column value can
> contain text data (in future they may contain image data as well) leading to
> the possibility of a row out-growing the RAM capacity. Given this background
> my questions are:
>
> a] How many columns are recommended against one row? Based on my app. needs,
> I can imagine having 10 million would be a good starting point for the
> max_limit (based on text data). Also note that my app. will use search in
> ranges of 100 or 200 columns when there are large number of records(columnar
> data) without a caching solution in the front.
> b] What partitioner is recommended? so that the load in the cluster nodes is
> not largely uneven.
> c] Would you recommend changing the TimeUUID/Columnar sort mechanism (with a
> change in the data model) to sort using row key mechanism? If so then what
> partitioner is recommended?  with load not being largely uneven.
>
> Thanks
>