You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Aditya Narayan <ad...@gmail.com> on 2011/10/29 06:42:40 UTC

Programmatically allow only one out of two types of rows in a CF to enter the CACHE

I need to keep the data of some entities in a single CF but split in two
rows for each entity. One row contains an overview information for the
entity & another row contains detailed information about entity. I am
wanting to keep both rows in single CF so they may be retrieved in a single
query when required together.

Now the problem I am facing is that I want to cache only first type of
rows(ie, the overview containing rows) & avoid second type rows(that
contains large data) from getting into cache.

Is there a way I can manipulate such filtering of cache entering rows from a
single CF?

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Mohit Anchlia <mo...@gmail.com>.

Yes you need 2 queries but a better schema design. I think you might
be trying to optimize where it actually will not give you any gain
except few lines of less code. Can you give an example of what you are
trying to do?

Other question why not store in within the same row?

On Sat, Oct 29, 2011 at 10:22 AM, Aditya Narayan <ad...@gmail.com> wrote:
> ..so that I can retrieve them through a single query.
>
> For reading cols from two CFs you need two queries, right ?
>
>
>
>
> On Sat, Oct 29, 2011 at 9:53 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
>>
>> Why not use 2 CFs?
>>
>> On Fri, Oct 28, 2011 at 9:42 PM, Aditya Narayan <ad...@gmail.com> wrote:
>> > I need to keep the data of some entities in a single CF but split in two
>> > rows for each entity. One row contains an overview information for the
>> > entity & another row contains detailed information about entity. I am
>> > wanting to keep both rows in single CF so they may be retrieved in a
>> > single
>> > query when required together.
>> >
>> > Now the problem I am facing is that I want to cache only first type of
>> > rows(ie, the overview containing rows) & avoid second type rows(that
>> > contains large data) from getting into cache.
>> >
>> > Is there a way I can manipulate such filtering of cache entering rows
>> > from a
>> > single CF?
>> >
>> >
>> >
>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Aditya Narayan <ad...@gmail.com>.

..so that I can retrieve them through a single query.

For reading cols from two CFs you need two queries, right ?




On Sat, Oct 29, 2011 at 9:53 PM, Mohit Anchlia <mo...@gmail.com>wrote:

> Why not use 2 CFs?
>
> On Fri, Oct 28, 2011 at 9:42 PM, Aditya Narayan <ad...@gmail.com> wrote:
> > I need to keep the data of some entities in a single CF but split in two
> > rows for each entity. One row contains an overview information for the
> > entity & another row contains detailed information about entity. I am
> > wanting to keep both rows in single CF so they may be retrieved in a
> single
> > query when required together.
> >
> > Now the problem I am facing is that I want to cache only first type of
> > rows(ie, the overview containing rows) & avoid second type rows(that
> > contains large data) from getting into cache.
> >
> > Is there a way I can manipulate such filtering of cache entering rows
> from a
> > single CF?
> >
> >
> >
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Mohit Anchlia <mo...@gmail.com>.

Why not use 2 CFs?

On Fri, Oct 28, 2011 at 9:42 PM, Aditya Narayan <ad...@gmail.com> wrote:
> I need to keep the data of some entities in a single CF but split in two
> rows for each entity. One row contains an overview information for the
> entity & another row contains detailed information about entity. I am
> wanting to keep both rows in single CF so they may be retrieved in a single
> query when required together.
>
> Now the problem I am facing is that I want to cache only first type of
> rows(ie, the overview containing rows) & avoid second type rows(that
> contains large data) from getting into cache.
>
> Is there a way I can manipulate such filtering of cache entering rows from a
> single CF?
>
>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by David Jeske <da...@gmail.com>.

If your summary data is frequently accessed, you will probably be best off
storing the two sets of data separately (either in separate column families
or with different key-prefixes). This will give you the greatest
cache-locality for your summary data, which you say is popular. If your
summary data is very well cached, then it won't matter that it's might
require two disk-seeks to get summary+details, because your summary data is
usually in cache anyhow.

If you want a more specific recommendation that that, we'd need to see
answers to the following questions:

(a) how big is the summary data (total, per row)? (average, max)
(b) how big is the detail data (total, per row)?     (average, max)
(b) what is the read/write traffic to the summary data? ..detail data?

A side note about caches... IMO, you're better off getting the cache
behavior you want through physical ordering than through more explicit
caching. This is because most modern databases (cassandra included) go
through the OS buffer cache already, and there is some amount of
duplicating of data involved in trying to application cache data. If your
application cache hitrate is very high (90%+) this can work out, but if
it's lower (50%) it can sometimes have poor effects on the cache efficiency
of both the application cache and OS buffer cache (because of data being
duplicated in both caches).

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Zach Richardson <j....@gmail.com>.

Aditya,

Have you done any benchmarking where you are specifically having read problems?

I will be surprised if using a technique described, you won't be able
to get the performance you are looking for.

Zach

On Sat, Oct 29, 2011 at 3:35 PM, Aditya Narayan <ad...@gmail.com> wrote:
> Thanks Zach, Nice Idea !
>
> and what about looking at, may be, some custom caching solutions, leaving
> aside cassandra caching   .. ?
>
>
>
> On Sun, Oct 30, 2011 at 2:00 AM, Zach Richardson
> <j....@gmail.com> wrote:
>>
>> Aditya,
>>
>> Depending on how often you have to write to the database, you could
>> perform dual writes to two different column families, one that has
>> summary + details in it, and one that only has the summary.
>>
>> This way you can get everything with one query, or the summary with
>> one query, this should also help optimize your caching.
>>
>> The question here would of course be whether or not you have a read or
>> write heavy workload.  Since you seem to be concerned about the
>> caching, it sounds like you have more of a read heavy workload and
>> wouldn't pay to heavily with the dual writes.
>>
>> Zach
>>
>>
>> On Sat, Oct 29, 2011 at 2:21 PM, Mohit Anchlia <mo...@gmail.com>
>> wrote:
>> > On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan <ad...@gmail.com>
>> > wrote:
>> >> @Mohit:
>> >> I have stated the example scenarios in my first post under this
>> >> heading.
>> >> Also I have stated above why I want to split that data in two rows &
>> >> like
>> >> Ikeda below stated, I'm too trying out to prevent the frequently
>> >> accessed
>> >> rows being bloated with large data & want to prevent that data from
>> >> entering
>> >> cache as well.
>> >
>> > I think you are missing the point. You don't get any benefit
>> > (performance, access), you are already breaking it into 2 rows.
>> >
>> > Also, I don't know of any way where you can selectively keep the rows
>> > or keys in the cache. Other than having some background job that keeps
>> > the cache hot with those keys/rows you only have one option of keeping
>> > it in different CF since you are already breaking a row in 2 rows.
>> >
>> >>
>> >>> Okay so as most know this practice is called a wide row - we use them
>> >>> quite a lot. However, as your schema shows it will cache (while being
>> >>> active) all the row in memory.  One way we got around this issue was
>> >>> to
>> >>> basically create some materialized views of any more common data so we
>> >>> can
>> >>> easily get to the minimum amount of information required without
>> >>> blowing too
>> >>> much memory with the larger representations.
>> >>
>> >> Yes exactly this is problem I am facing but I want to keep the both the
>> >> types(common + large/detailed) of data in single CF so that it could
>> >> server
>> >> 'two materialized views'.
>> >>
>> >>>
>> >>> My perspective is that indexing some of the higher levels of data
>> >>> would be
>> >>> the way to go - Solr or elastic search for distributed or if you know
>> >>> you
>> >>> only need it local just use a caching solution like ehcache
>> >>
>> >> What do you mean exactly by  "indexing some of the higher levels of
>> >> data" ?
>> >>
>> >> Thanks you guys!
>> >>
>> >>
>> >>
>> >>>
>> >>> Anthony
>> >>>
>> >>>
>> >>> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
>> >>>
>> >>> > I need to keep the data of some entities in a single CF but split in
>> >>> > two
>> >>> > rows for each entity. One row contains an overview information for
>> >>> > the
>> >>> > entity & another row contains detailed information about entity. I
>> >>> > am
>> >>> > wanting to keep both rows in single CF so they may be retrieved in a
>> >>> > single
>> >>> > query when required together.
>> >>> >
>> >>> > Now the problem I am facing is that I want to cache only first type
>> >>> > of
>> >>> > rows(ie, the overview containing rows) & avoid second type rows(that
>> >>> > contains large data) from getting into cache.
>> >>> >
>> >>> > Is there a way I can manipulate such filtering of cache entering
>> >>> > rows
>> >>> > from a single CF?
>> >>> >
>> >>> >
>> >>>
>> >>
>> >>
>> >
>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Aditya Narayan <ad...@gmail.com>.

Thanks Zach, Nice Idea !

and what about looking at, may be, some custom caching solutions, leaving
aside cassandra caching   .. ?



On Sun, Oct 30, 2011 at 2:00 AM, Zach Richardson <
j.zach.richardson@gmail.com> wrote:

> Aditya,
>
> Depending on how often you have to write to the database, you could
> perform dual writes to two different column families, one that has
> summary + details in it, and one that only has the summary.
>
> This way you can get everything with one query, or the summary with
> one query, this should also help optimize your caching.
>
> The question here would of course be whether or not you have a read or
> write heavy workload.  Since you seem to be concerned about the
> caching, it sounds like you have more of a read heavy workload and
> wouldn't pay to heavily with the dual writes.
>
> Zach
>
>
> On Sat, Oct 29, 2011 at 2:21 PM, Mohit Anchlia <mo...@gmail.com>
> wrote:
> > On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan <ad...@gmail.com>
> wrote:
> >> @Mohit:
> >> I have stated the example scenarios in my first post under this heading.
> >> Also I have stated above why I want to split that data in two rows &
> like
> >> Ikeda below stated, I'm too trying out to prevent the frequently
> accessed
> >> rows being bloated with large data & want to prevent that data from
> entering
> >> cache as well.
> >
> > I think you are missing the point. You don't get any benefit
> > (performance, access), you are already breaking it into 2 rows.
> >
> > Also, I don't know of any way where you can selectively keep the rows
> > or keys in the cache. Other than having some background job that keeps
> > the cache hot with those keys/rows you only have one option of keeping
> > it in different CF since you are already breaking a row in 2 rows.
> >
> >>
> >>> Okay so as most know this practice is called a wide row - we use them
> >>> quite a lot. However, as your schema shows it will cache (while being
> >>> active) all the row in memory.  One way we got around this issue was to
> >>> basically create some materialized views of any more common data so we
> can
> >>> easily get to the minimum amount of information required without
> blowing too
> >>> much memory with the larger representations.
> >>
> >> Yes exactly this is problem I am facing but I want to keep the both the
> >> types(common + large/detailed) of data in single CF so that it could
> server
> >> 'two materialized views'.
> >>
> >>>
> >>> My perspective is that indexing some of the higher levels of data would
> be
> >>> the way to go - Solr or elastic search for distributed or if you know
> you
> >>> only need it local just use a caching solution like ehcache
> >>
> >> What do you mean exactly by  "indexing some of the higher levels of
> data" ?
> >>
> >> Thanks you guys!
> >>
> >>
> >>
> >>>
> >>> Anthony
> >>>
> >>>
> >>> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
> >>>
> >>> > I need to keep the data of some entities in a single CF but split in
> two
> >>> > rows for each entity. One row contains an overview information for
> the
> >>> > entity & another row contains detailed information about entity. I am
> >>> > wanting to keep both rows in single CF so they may be retrieved in a
> single
> >>> > query when required together.
> >>> >
> >>> > Now the problem I am facing is that I want to cache only first type
> of
> >>> > rows(ie, the overview containing rows) & avoid second type rows(that
> >>> > contains large data) from getting into cache.
> >>> >
> >>> > Is there a way I can manipulate such filtering of cache entering rows
> >>> > from a single CF?
> >>> >
> >>> >
> >>>
> >>
> >>
> >
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Zach Richardson <j....@gmail.com>.

Aditya,

Depending on how often you have to write to the database, you could
perform dual writes to two different column families, one that has
summary + details in it, and one that only has the summary.

This way you can get everything with one query, or the summary with
one query, this should also help optimize your caching.

The question here would of course be whether or not you have a read or
write heavy workload.  Since you seem to be concerned about the
caching, it sounds like you have more of a read heavy workload and
wouldn't pay to heavily with the dual writes.

Zach


On Sat, Oct 29, 2011 at 2:21 PM, Mohit Anchlia <mo...@gmail.com> wrote:
> On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan <ad...@gmail.com> wrote:
>> @Mohit:
>> I have stated the example scenarios in my first post under this heading.
>> Also I have stated above why I want to split that data in two rows & like
>> Ikeda below stated, I'm too trying out to prevent the frequently accessed
>> rows being bloated with large data & want to prevent that data from entering
>> cache as well.
>
> I think you are missing the point. You don't get any benefit
> (performance, access), you are already breaking it into 2 rows.
>
> Also, I don't know of any way where you can selectively keep the rows
> or keys in the cache. Other than having some background job that keeps
> the cache hot with those keys/rows you only have one option of keeping
> it in different CF since you are already breaking a row in 2 rows.
>
>>
>>> Okay so as most know this practice is called a wide row - we use them
>>> quite a lot. However, as your schema shows it will cache (while being
>>> active) all the row in memory.  One way we got around this issue was to
>>> basically create some materialized views of any more common data so we can
>>> easily get to the minimum amount of information required without blowing too
>>> much memory with the larger representations.
>>
>> Yes exactly this is problem I am facing but I want to keep the both the
>> types(common + large/detailed) of data in single CF so that it could server
>> 'two materialized views'.
>>
>>>
>>> My perspective is that indexing some of the higher levels of data would be
>>> the way to go - Solr or elastic search for distributed or if you know you
>>> only need it local just use a caching solution like ehcache
>>
>> What do you mean exactly by  "indexing some of the higher levels of data" ?
>>
>> Thanks you guys!
>>
>>
>>
>>>
>>> Anthony
>>>
>>>
>>> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
>>>
>>> > I need to keep the data of some entities in a single CF but split in two
>>> > rows for each entity. One row contains an overview information for the
>>> > entity & another row contains detailed information about entity. I am
>>> > wanting to keep both rows in single CF so they may be retrieved in a single
>>> > query when required together.
>>> >
>>> > Now the problem I am facing is that I want to cache only first type of
>>> > rows(ie, the overview containing rows) & avoid second type rows(that
>>> > contains large data) from getting into cache.
>>> >
>>> > Is there a way I can manipulate such filtering of cache entering rows
>>> > from a single CF?
>>> >
>>> >
>>>
>>
>>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Mohit Anchlia <mo...@gmail.com>.

On Sat, Oct 29, 2011 at 11:23 AM, Aditya Narayan <ad...@gmail.com> wrote:
> @Mohit:
> I have stated the example scenarios in my first post under this heading.
> Also I have stated above why I want to split that data in two rows & like
> Ikeda below stated, I'm too trying out to prevent the frequently accessed
> rows being bloated with large data & want to prevent that data from entering
> cache as well.

I think you are missing the point. You don't get any benefit
(performance, access), you are already breaking it into 2 rows.

Also, I don't know of any way where you can selectively keep the rows
or keys in the cache. Other than having some background job that keeps
the cache hot with those keys/rows you only have one option of keeping
it in different CF since you are already breaking a row in 2 rows.

>
>> Okay so as most know this practice is called a wide row - we use them
>> quite a lot. However, as your schema shows it will cache (while being
>> active) all the row in memory.  One way we got around this issue was to
>> basically create some materialized views of any more common data so we can
>> easily get to the minimum amount of information required without blowing too
>> much memory with the larger representations.
>
> Yes exactly this is problem I am facing but I want to keep the both the
> types(common + large/detailed) of data in single CF so that it could server
> 'two materialized views'.
>
>>
>> My perspective is that indexing some of the higher levels of data would be
>> the way to go - Solr or elastic search for distributed or if you know you
>> only need it local just use a caching solution like ehcache
>
> What do you mean exactly by  "indexing some of the higher levels of data" ?
>
> Thanks you guys!
>
>
>
>>
>> Anthony
>>
>>
>> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
>>
>> > I need to keep the data of some entities in a single CF but split in two
>> > rows for each entity. One row contains an overview information for the
>> > entity & another row contains detailed information about entity. I am
>> > wanting to keep both rows in single CF so they may be retrieved in a single
>> > query when required together.
>> >
>> > Now the problem I am facing is that I want to cache only first type of
>> > rows(ie, the overview containing rows) & avoid second type rows(that
>> > contains large data) from getting into cache.
>> >
>> > Is there a way I can manipulate such filtering of cache entering rows
>> > from a single CF?
>> >
>> >
>>
>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Anthony Ikeda <an...@gmail.com>.

By higher level data I meant the common data.

For example we plan on creating an index using Solr for search but as its lucene based you can store the common data as part of a Document. It won't be indexed but is still accessible as the document will share the same "id" as the row key. 

Sent from my iPhone

On 29/10/2011, at 11:23, Aditya Narayan <ad...@gmail.com> wrote:

> @Mohit:
> I have stated the example scenarios in my first post under this heading.
> Also I have stated above why I want to split that data in two rows & like Ikeda below stated, I'm too trying out to prevent the frequently accessed rows being bloated with large data & want to prevent that data from entering cache as well.
> 
> Okay so as most know this practice is called a wide row - we use them quite a lot. However, as your schema shows it will cache (while being active) all the row in memory.  One way we got around this issue was to basically create some materialized views of any more common data so we can easily get to the minimum amount of information required without blowing too much memory with the larger representations.
> Yes exactly this is problem I am facing but I want to keep the both the types(common + large/detailed) of data in single CF so that it could server 'two materialized views'.
>  
> 
> My perspective is that indexing some of the higher levels of data would be the way to go - Solr or elastic search for distributed or if you know you only need it local just use a caching solution like ehcache
> What do you mean exactly by  "indexing some of the higher levels of data" ?
> 
> Thanks you guys!
> 
> 
>  
> Anthony
> 
> 
> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
> 
> > I need to keep the data of some entities in a single CF but split in two rows for each entity. One row contains an overview information for the entity & another row contains detailed information about entity. I am wanting to keep both rows in single CF so they may be retrieved in a single query when required together.
> >
> > Now the problem I am facing is that I want to cache only first type of rows(ie, the overview containing rows) & avoid second type rows(that contains large data) from getting into cache.
> >
> > Is there a way I can manipulate such filtering of cache entering rows from a single CF?
> >
> >
> 
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Aditya Narayan <ad...@gmail.com>.

@Mohit:
I have stated the example scenarios in my first post under this heading.
Also I have stated above why I want to split that data in two rows & like
Ikeda below stated, I'm too trying out to prevent the frequently accessed
rows being bloated with large data & want to prevent that data from entering
cache as well.

Okay so as most know this practice is called a wide row - we use them quite
> a lot. However, as your schema shows it will cache (while being active) all
> the row in memory.  One way we got around this issue was to basically create
> some materialized views of any more common data so we can easily get to the
> minimum amount of information required without blowing too much memory with
> the larger representations.
>
Yes exactly this is problem I am facing but I want to keep the both the
types(common + large/detailed) of data in single CF so that it could server
'two materialized views'.


>
> My perspective is that indexing some of the higher levels of data would be
> the way to go - Solr or elastic search for distributed or if you know you
> only need it local just use a caching solution like ehcache

What do you mean exactly by  "indexing some of the higher levels of data" ?

Thanks you guys!




> Anthony
>
>
> On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:
>
> > I need to keep the data of some entities in a single CF but split in two
> rows for each entity. One row contains an overview information for the
> entity & another row contains detailed information about entity. I am
> wanting to keep both rows in single CF so they may be retrieved in a single
> query when required together.
> >
> > Now the problem I am facing is that I want to cache only first type of
> rows(ie, the overview containing rows) & avoid second type rows(that
> contains large data) from getting into cache.
> >
> > Is there a way I can manipulate such filtering of cache entering rows
> from a single CF?
> >
> >
>
>

Re: Programmatically allow only one out of two types of rows in a CF to enter the CACHE

Posted by Ikeda Anthony <an...@gmail.com>.

Okay so as most know this practice is called a wide row - we use them quite a lot. However, as your schema shows it will cache (while being active) all the row in memory.  One way we got around this issue was to basically create some materialized views of any more common data so we can easily get to the minimum amount of information required without blowing too much memory with the larger representations.

My perspective is that indexing some of the higher levels of data would be the way to go - Solr or elastic search for distributed or if you know you only need it local just use a caching solution like ehcache.

Anthony

On 28/10/2011, at 21:42 PM, Aditya Narayan wrote:

> I need to keep the data of some entities in a single CF but split in two rows for each entity. One row contains an overview information for the entity & another row contains detailed information about entity. I am wanting to keep both rows in single CF so they may be retrieved in a single query when required together. 
> 
> Now the problem I am facing is that I want to cache only first type of rows(ie, the overview containing rows) & avoid second type rows(that contains large data) from getting into cache.
> 
> Is there a way I can manipulate such filtering of cache entering rows from a single CF?
> 
>