You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Vodnok <vo...@gmail.com> on 2011/03/01 23:39:04 UTC

Advice on a design

Hi,

Totaly newbie on Cassandra (with phpcassa) with big background on
relationned database, i'm would like to use Cassandra for a trivial case. So
i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
wrong but i want to learn so i'm here


I would like your advise on a design for cassandra.


Case:

- Users created Docs and can share docs with friends
- Users can read and share docs of their friends with other friends
- Docs can be of different type [text;picture;video;etc]
- Docs can be taggued



Typical queries :


- Doc relative to tag
- Doc relative to mutiple tags
- Doc readed by user x
- Doc relative to tag and ratio readed_shared greater than x (see design)
- All doc of type='IMG' favorized by my friend
- All doc of type='BOT' and c_bot_code='ABC'
- All doc of type='BOT' favorized by my friend relative (tag) with 'fire'
and 'belgium' ?



Design :


docs // all docs
{
    ‘123456’: //id_docs
    {
        ‘t_info’:
{
 'c_type':'BOT'
'b_del':'y'
'b_publish':'y'
 }
't_info_type':
{
 'l_title':'Hello World!'
'c_bot_code':'ABC'
 }
't_read_user' : //read by user x
{
 //time + id_user
'123456789_123':'123'
'123456789_155':'155'
 }
't_shared_user' : //share by user x
{
 //time + id_user
'123456789_123':'123'
'123456789_155':'155'
 }
't_tags'
{
 'fire':'fire'
'belgium':'belgium'
}
 't_stats'
{
'n_readed':'60'
 'n_shared':'6'
'n_ratio_readed_shared':'0.1'
 }
}
}


tags_docs // all tag linked to docs
{
'fire'://tag
{
//creation_time + id_docs
 '456789_123456':
{
'id_doc':'123456'
 'time':'456789'
}
'456789_223456':'223456':
 {
'id_doc':'123456'
'time':'456789'
 }
'456789_323456':'223456':
{
 'id_doc':'123456'
'time':'456789'
}
 }
'belgium':
{
 ...
}
}


users // all users
{
    ‘123’: //id_user
    {
        ‘t_info’:
{
 l_name:'Boris'
c_lang='fr'

}
 't_readed_docs':
{
//time + id_doc
 '123456789_123456':'123456'
'123458789_136456':'136456'
 }
't_shared_docs':
{
 //time + id_doc
'123456789_123456':'123456'
'123458789_136456':'136456'
 }
}
}


users_docs // all action by users on docs
{
    ‘123_123456’: // id_user + id_doc
    {
'id_doc':'123456'
 'id_user':'123'
'd_readed':'20110301'
'd_shared':'20110301'
 }
}


user_friends_act // all activity of user friends
{
    ‘123’:// id_user
    {
't_readed_docs': //all docs readed by my friends
{
'223456_224_123456': // time + id_friend + id_docs
 {
'id_friend':'224'
'id_docs':'123456'
 'time':'223456'
'c_type='BOT'
 }
}
't_shared_docs': //all docs shared by my friends
 {
'223456_224_123456': // time + id_friend + id_docs
{
 'id_friend':'224'
'id_docs':'123456'
 'time':'223456'
'c_type='BOT'
 }
}
}
}



I know that certain queries are not possible for now like : - All doc of
type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?



What do you think ?


Thank you,


Vodnok,


(Please remember i'm on cassandra since 3 days)

Re: Advice on a design

Posted by Vodnok <vo...@gmail.com>.
Ok seems that i'll use Solr (with dedicated Cassandra) for search

I've readed this article :
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/
on
RP vs OPP...


Here is my case


docs_shared{ //docs shared by users ordered by time
    'time:id_user:id_doc'
    {
        'time':'123456' //index on it
        'id_user':'123' //index on it
        'c_type':'BOT' //index on it
        'id_doc':'123' //index on it
    }
}

So i can list all doc shared by id_user = 123 and type ='BOT' ordered by
time....

Well i wanted because i discovered the RP vs OPP issue. I'm default so RP
and so row id are not ordered !!! And as it's recommanded, i would like to
stay RP

So other possibility is addind a dimension with super column as column are
ordered in RP

index{
docs_shared{ //docs shared by users ordered by time
    'time:id_user:id_doc'
    {
        'time':'123456' //index on it
        'id_user':'123' //index on it
        'c_type':'BOT' //index on it
        'id_doc':'123'
    }
}
}

BUT.... sexondary index is not possible on SC -> C


So next possibility is

index{
docs_shared_time_c_type_id_user{ //docs shared by users ordered by
time:c_type:id_user
    'time:c_type:id_user:id_doc' : 'id_doc'
}
docs_shared_c_type_time_id_user{ //docs shared by users ordered by
time:id_user:c_type
    'c_type:time:id_user:id_do' : 'id_doc'
}
... (there is 6 combinations of time c_type id_user)
}

Like that i can list with keystart and keyend and filters

Example :

No filter : index -> time:c_type:id_user
Filter on c_type :  index -> c_type:time:id_user
Filter on id_user :  index -> id_user:time:c_type
Filter on c_type and id_user : index -> id_user:c_type:time

Fortunately cassandra likes writing !!! (Ironic inside)


So i have a question : i've readed that secondary index on SC->C will maybe
arrive in next releases... Is this information true ? And is it already
planned ?


Thank you,

Sébastien,

2011/3/2 Burc Sade <bu...@gmail.com>

> You can use PHP Solr Extension. It is a fully featured and light-weight
> client.
>
> http://www.php.net/manual/en/book.solr.php
>
> Without the secondary indexes on columns in CFs within SCFs, the best
> approach is to create query-specific CFs at the moment. In the end all comes
> down to how simple you can make your queries to have a minimum CF count.
>
> Regards,
> Burc
>
> On Wed, Mar 2, 2011 at 9:06 AM, Vodnok <vo...@gmail.com> wrote:
>
>> I think too via Solr it'll be easier. Just need to google it. (if you have
>> links about Solr in php...)
>>
>> I realize that i have to remove some dimension to my CF...
>>
>> I thought it was possible to have SCF -> CF -> SC -> C:value having
>> secondary index on C but has i understood, secondary index on C on super is
>> not possible for now (but will be maybe in next version)
>> As i understand it's better to have more less complex CF then less more
>> complex CF
>>
>> Thank you for your reply,
>>
>>
>>
>> 2011/3/2 Burc Sade <bu...@gmail.com>
>>
>> Hi Vodnok,
>>>
>>> For tag searches I would use a search engine like Solr (Lucene), as I
>>> think it would be more flexible to query. You can update the index as new
>>> data comes in and query it for queries #1, #2 and #4.
>>>
>>> For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create
>>> the CF below.
>>>
>>> doc_types
>>> {
>>>    'BOT:ABC':
>>>   {
>>>     <docid>: <creation_date?>
>>>   }
>>> }
>>>
>>> You can assign a value you are going to need when after querying to the
>>> docid. The problem with this schema is that if there are not many
>>> type:c_bot_code combinations, there will be many columns under each key in
>>> this CF. If a combination has much much more columns than others, hot spot
>>> problem may arise.
>>>
>>>
>>>
>>> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Totaly newbie on Cassandra (with phpcassa) with big background on
>>>> relationned database, i'm would like to use Cassandra for a trivial case. So
>>>> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
>>>> wrong but i want to learn so i'm here
>>>>
>>>>
>>>> I would like your advise on a design for cassandra.
>>>>
>>>>
>>>> Case:
>>>>
>>>> - Users created Docs and can share docs with friends
>>>> - Users can read and share docs of their friends with other friends
>>>> - Docs can be of different type [text;picture;video;etc]
>>>> - Docs can be taggued
>>>>
>>>>
>>>>
>>>> Typical queries :
>>>>
>>>>
>>>> - Doc relative to tag
>>>> - Doc relative to mutiple tags
>>>> - Doc readed by user x
>>>> - Doc relative to tag and ratio readed_shared greater than x (see
>>>> design)
>>>> - All doc of type='IMG' favorized by my friend
>>>> - All doc of type='BOT' and c_bot_code='ABC'
>>>> - All doc of type='BOT' favorized by my friend relative (tag) with
>>>> 'fire' and 'belgium' ?
>>>>
>>>>
>>>>
>>>> Design :
>>>>
>>>>
>>>> docs // all docs
>>>> {
>>>>     ‘123456’: //id_docs
>>>>     {
>>>>         ‘t_info’:
>>>> {
>>>>  'c_type':'BOT'
>>>> 'b_del':'y'
>>>> 'b_publish':'y'
>>>>  }
>>>> 't_info_type':
>>>> {
>>>>  'l_title':'Hello World!'
>>>> 'c_bot_code':'ABC'
>>>>  }
>>>> 't_read_user' : //read by user x
>>>> {
>>>>  //time + id_user
>>>> '123456789_123':'123'
>>>> '123456789_155':'155'
>>>>  }
>>>> 't_shared_user' : //share by user x
>>>> {
>>>>  //time + id_user
>>>> '123456789_123':'123'
>>>> '123456789_155':'155'
>>>>  }
>>>> 't_tags'
>>>> {
>>>>  'fire':'fire'
>>>> 'belgium':'belgium'
>>>> }
>>>>  't_stats'
>>>> {
>>>> 'n_readed':'60'
>>>>  'n_shared':'6'
>>>> 'n_ratio_readed_shared':'0.1'
>>>>  }
>>>> }
>>>> }
>>>>
>>>>
>>>> tags_docs // all tag linked to docs
>>>> {
>>>> 'fire'://tag
>>>> {
>>>> //creation_time + id_docs
>>>>  '456789_123456':
>>>> {
>>>> 'id_doc':'123456'
>>>>  'time':'456789'
>>>> }
>>>> '456789_223456':'223456':
>>>>  {
>>>> 'id_doc':'123456'
>>>> 'time':'456789'
>>>>  }
>>>> '456789_323456':'223456':
>>>> {
>>>>  'id_doc':'123456'
>>>> 'time':'456789'
>>>> }
>>>>  }
>>>> 'belgium':
>>>> {
>>>>  ...
>>>> }
>>>> }
>>>>
>>>>
>>>> users // all users
>>>> {
>>>>     ‘123’: //id_user
>>>>     {
>>>>         ‘t_info’:
>>>> {
>>>>  l_name:'Boris'
>>>> c_lang='fr'
>>>>
>>>> }
>>>>  't_readed_docs':
>>>> {
>>>> //time + id_doc
>>>>  '123456789_123456':'123456'
>>>> '123458789_136456':'136456'
>>>>  }
>>>> 't_shared_docs':
>>>> {
>>>>  //time + id_doc
>>>> '123456789_123456':'123456'
>>>> '123458789_136456':'136456'
>>>>  }
>>>> }
>>>> }
>>>>
>>>>
>>>> users_docs // all action by users on docs
>>>> {
>>>>     ‘123_123456’: // id_user + id_doc
>>>>     {
>>>> 'id_doc':'123456'
>>>>  'id_user':'123'
>>>> 'd_readed':'20110301'
>>>> 'd_shared':'20110301'
>>>>  }
>>>> }
>>>>
>>>>
>>>> user_friends_act // all activity of user friends
>>>> {
>>>>     ‘123’:// id_user
>>>>     {
>>>> 't_readed_docs': //all docs readed by my friends
>>>> {
>>>> '223456_224_123456': // time + id_friend + id_docs
>>>>  {
>>>> 'id_friend':'224'
>>>> 'id_docs':'123456'
>>>>  'time':'223456'
>>>> 'c_type='BOT'
>>>>  }
>>>> }
>>>> 't_shared_docs': //all docs shared by my friends
>>>>  {
>>>> '223456_224_123456': // time + id_friend + id_docs
>>>> {
>>>>  'id_friend':'224'
>>>> 'id_docs':'123456'
>>>>  'time':'223456'
>>>> 'c_type='BOT'
>>>>  }
>>>> }
>>>> }
>>>> }
>>>>
>>>>
>>>>
>>>> I know that certain queries are not possible for now like : - All doc of
>>>> type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
>>>>
>>>>
>>>>
>>>> What do you think ?
>>>>
>>>>
>>>> Thank you,
>>>>
>>>>
>>>> Vodnok,
>>>>
>>>>
>>>> (Please remember i'm on cassandra since 3 days)
>>>>
>>>
>>>
>>
>

Re: Advice on a design

Posted by Jeremy Hanna <je...@gmail.com>.
Have you considered using Solandra (Solr/Lucene + Cassandra) - https://github.com/tjake/Lucandra#readme ?  There is a #solandra channel on freenode if you had any questions as well.

On Mar 3, 2011, at 8:00 AM, Vodnok wrote:

> Ok seems that i'll use Solr (with dedicated Cassandra) for search
> 
> I've readed this article : http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/ on RP vs OPP... 
> 
> 
> Here is my case
> 
> 
> docs_shared{ //docs shared by users ordered by time
>     'time:id_user:id_doc' 
>     {
>         'time':'123456' //index on it
>         'id_user':'123' //index on it
>         'c_type':'BOT' //index on it
>         'id_doc':'123' //index on it      
>     }
> } 
> 
> So i can list all doc shared by id_user = 123 and type ='BOT' ordered by time....
> 
> Well i wanted because i discovered the RP vs OPP issue. I'm default so RP and so row id are not ordered !!! And as it's recommanded, i would like to stay RP
> 
> So other possibility is addind a dimension with super column as column are ordered in RP
> 
> index{
> docs_shared{ //docs shared by users ordered by time
>     'time:id_user:id_doc' 
>     {
>         'time':'123456' //index on it
>         'id_user':'123' //index on it
>         'c_type':'BOT' //index on it
>         'id_doc':'123' 
>     }
> } 
> }
> 
> BUT.... sexondary index is not possible on SC -> C
> 
> 
> So next possibility is
> 
> index{
> docs_shared_time_c_type_id_user{ //docs shared by users ordered by time:c_type:id_user
>     'time:c_type:id_user:id_doc' : 'id_doc'
> } 
> docs_shared_c_type_time_id_user{ //docs shared by users ordered by time:id_user:c_type
>     'c_type:time:id_user:id_do' : 'id_doc'
> } 
> ... (there is 6 combinations of time c_type id_user)
> }
> 
> Like that i can list with keystart and keyend and filters
> 
> Example :
> 
> No filter : index -> time:c_type:id_user
> Filter on c_type :  index -> c_type:time:id_user
> Filter on id_user :  index -> id_user:time:c_type
> Filter on c_type and id_user : index -> id_user:c_type:time
> 
> Fortunately cassandra likes writing !!! (Ironic inside)
> 
> 
> So i have a question : i've readed that secondary index on SC->C will maybe arrive in next releases... Is this information true ? And is it already planned ?
> 
> 
> Thank you,
> 
> Sébastien,
> 
> 2011/3/2 Burc Sade <bu...@gmail.com>
> You can use PHP Solr Extension. It is a fully featured and light-weight client.
> 
> http://www.php.net/manual/en/book.solr.php
> 
> Without the secondary indexes on columns in CFs within SCFs, the best approach is to create query-specific CFs at the moment. In the end all comes down to how simple you can make your queries to have a minimum CF count.
> 
> Regards,
> Burc
> 
> On Wed, Mar 2, 2011 at 9:06 AM, Vodnok <vo...@gmail.com> wrote:
> I think too via Solr it'll be easier. Just need to google it. (if you have links about Solr in php...)
> 
> I realize that i have to remove some dimension to my CF...
> 
> I thought it was possible to have SCF -> CF -> SC -> C:value having secondary index on C but has i understood, secondary index on C on super is not possible for now (but will be maybe in next version)
> As i understand it's better to have more less complex CF then less more complex CF
> 
> Thank you for your reply,
> 
> 
> 
> 2011/3/2 Burc Sade <bu...@gmail.com>
> 
> Hi Vodnok,
> 
> For tag searches I would use a search engine like Solr (Lucene), as I think it would be more flexible to query. You can update the index as new data comes in and query it for queries #1, #2 and #4.
> 
> For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create the CF below.
> 
> doc_types
> {
>    'BOT:ABC':
>   {
>     <docid>: <creation_date?> 
>   } 
> }
> 
> You can assign a value you are going to need when after querying to the docid. The problem with this schema is that if there are not many type:c_bot_code combinations, there will be many columns under each key in this CF. If a combination has much much more columns than others, hot spot problem may arise.
> 
> 
> 
> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:
> Hi,
> 
> Totaly newbie on Cassandra (with phpcassa) with big background on relationned database, i'm would like to use Cassandra for a trivial case. So i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm wrong but i want to learn so i'm here
> 
> 
> I would like your advise on a design for cassandra.
> 
> 
> Case:
> 
> - Users created Docs and can share docs with friends
> - Users can read and share docs of their friends with other friends
> - Docs can be of different type [text;picture;video;etc]
> - Docs can be taggued
> 
> 
> 
> Typical queries :
> 
> 
> - Doc relative to tag
> - Doc relative to mutiple tags
> - Doc readed by user x
> - Doc relative to tag and ratio readed_shared greater than x (see design)
> - All doc of type='IMG' favorized by my friend
> - All doc of type='BOT' and c_bot_code='ABC'
> - All doc of type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
> 
> 
> 
> Design :
> 
> 
> docs // all docs
> {
>     ‘123456’: //id_docs
>     {
>         ‘t_info’: 
> 		{
> 			'c_type':'BOT'
> 			'b_del':'y'
> 			'b_publish':'y'
> 		}
> 		't_info_type':
> 		{
> 			'l_title':'Hello World!'
> 			'c_bot_code':'ABC'
> 		}
> 		't_read_user' : //read by user x
> 		{
> 			//time + id_user
> 			'123456789_123':'123'
> 			'123456789_155':'155'			
> 		}
> 		't_shared_user' : //share by user x
> 		{
> 			//time + id_user
> 			'123456789_123':'123'
> 			'123456789_155':'155'			
> 		}
> 		't_tags'
> 		{
> 			'fire':'fire'
> 			'belgium':'belgium'
> 		}
> 		't_stats'
> 		{
> 			'n_readed':'60'
> 			'n_shared':'6'
> 			'n_ratio_readed_shared':'0.1'			
> 		}
> 	}
> }
> 
> 
> tags_docs // all tag linked to docs
> {
> 	'fire'://tag
> 	{
> 		//creation_time + id_docs
> 		'456789_123456':
> 		{
> 			'id_doc':'123456'
> 			'time':'456789'
> 		}
> 		'456789_223456':'223456':
> 		{
> 			'id_doc':'123456'
> 			'time':'456789'
> 		}
> 		'456789_323456':'223456':
> 		{
> 			'id_doc':'123456'
> 			'time':'456789'
> 		}
> 	}
> 	'belgium':
> 	{
> 		...
> 	}	
> }
> 
> 
> users // all users
> {
>     ‘123’: //id_user
>     {
>         ‘t_info’: 
> 		{
> 			l_name:'Boris'
> 			c_lang='fr'
> 
> 		}
> 		't_readed_docs':
> 		{
> 			//time + id_doc
> 			'123456789_123456':'123456'
> 			'123458789_136456':'136456'
> 		}
> 		't_shared_docs':
> 		{
> 			//time + id_doc
> 			'123456789_123456':'123456'
> 			'123458789_136456':'136456'
> 		}	
> 	}	
> }
> 
> 
> users_docs // all action by users on docs
> {
>     ‘123_123456’: // id_user + id_doc
>     {
> 		'id_doc':'123456'
> 		'id_user':'123'
> 		'd_readed':'20110301'
> 		'd_shared':'20110301'
> 	}
> }
> 
> 
> user_friends_act // all activity of user friends
> {
>     ‘123’:// id_user
>     {
> 		't_readed_docs': //all docs readed by my friends
> 		{
> 			'223456_224_123456': // time + id_friend + id_docs
> 			{
> 				'id_friend':'224'
> 				'id_docs':'123456'				
> 				'time':'223456'
> 				'c_type='BOT'	
> 			}
> 		}
> 		't_shared_docs': //all docs shared by my friends
> 		{
> 			'223456_224_123456': // time + id_friend + id_docs
> 			{
> 				'id_friend':'224'
> 				'id_docs':'123456'				
> 				'time':'223456'
> 				'c_type='BOT'	
> 			}
> 		}
> 	}
> }
> 
> 
> 
> I know that certain queries are not possible for now like : - All doc of type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
> 
> 
> 
> What do you think ?
> 
> 
> Thank you,
> 
> 
> Vodnok,
> 
> 
> (Please remember i'm on cassandra since 3 days)
> 
> 
> 
> 


Re: Advice on a design

Posted by Vodnok <vo...@msn.com>.
Ok seems that i'll use Solr (with dedicated Cassandra) for search

I've readed this article :
http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/on
RP vs OPP...


Here is my case


docs_shared{ //docs shared by users ordered by time
    'time:id_user:id_doc'
    {
        'time':'123456' //index on it
        'id_user':'123' //index on it
        'c_type':'BOT' //index on it
        'id_doc':'123' //index on it
    }
}

So i can list all doc shared by id_user = 123 and type ='BOT' ordered by
time....

Well i wanted because i discovered the RP vs OPP issue. I'm default so RP
and so row id are not ordered !!! And as it's recommanded, i would like to
stay RP

So other possibility is addind a dimension with super column as column are
ordered in RP

index{
docs_shared{ //docs shared by users ordered by time
    'time:id_user:id_doc'
    {
        'time':'123456' //index on it
        'id_user':'123' //index on it
        'c_type':'BOT' //index on it
        'id_doc':'123'
    }
}
}

BUT.... sexondary index is not possible on SC -> C


So next possibility is

index{
docs_shared_time_c_type_id_user{ //docs shared by users ordered by
time:c_type:id_user
    'time:c_type:id_user:id_doc' : 'id_doc'
}
docs_shared_c_type_time_id_user{ //docs shared by users ordered by
time:id_user:c_type
    'c_type:time:id_user:id_do' : 'id_doc'
}
... (there is 6 combinations of time c_type id_user)
}

Like that i can list with keystart and keyend and filters

Example :

No filter : index -> time:c_type:id_user
Filter on c_type :  index -> c_type:time:id_user
Filter on id_user :  index -> id_user:time:c_type
Filter on c_type and id_user : index -> id_user:c_type:time

Fortunately cassandra likes writing !!! (Ironic inside)


So i have a question : i've readed that secondary index on SC->C will maybe
arrive in next releases... Is this information true ? And is it already
planned ?


Thank you,

Sébastien,

2011/3/2 Burc Sade <bu...@gmail.com>

> You can use PHP Solr Extension. It is a fully featured and light-weight
> client.
>
> http://www.php.net/manual/en/book.solr.php
>
> Without the secondary indexes on columns in CFs within SCFs, the best
> approach is to create query-specific CFs at the moment. In the end all comes
> down to how simple you can make your queries to have a minimum CF count.
>
> Regards,
> Burc
>
> On Wed, Mar 2, 2011 at 9:06 AM, Vodnok <vo...@gmail.com> wrote:
>
>> I think too via Solr it'll be easier. Just need to google it. (if you have
>> links about Solr in php...)
>>
>> I realize that i have to remove some dimension to my CF...
>>
>> I thought it was possible to have SCF -> CF -> SC -> C:value having
>> secondary index on C but has i understood, secondary index on C on super is
>> not possible for now (but will be maybe in next version)
>> As i understand it's better to have more less complex CF then less more
>> complex CF
>>
>> Thank you for your reply,
>>
>>
>>
>> 2011/3/2 Burc Sade <bu...@gmail.com>
>>
>> Hi Vodnok,
>>>
>>> For tag searches I would use a search engine like Solr (Lucene), as I
>>> think it would be more flexible to query. You can update the index as new
>>> data comes in and query it for queries #1, #2 and #4.
>>>
>>> For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create
>>> the CF below.
>>>
>>> doc_types
>>> {
>>>    'BOT:ABC':
>>>   {
>>>     <docid>: <creation_date?>
>>>   }
>>> }
>>>
>>> You can assign a value you are going to need when after querying to the
>>> docid. The problem with this schema is that if there are not many
>>> type:c_bot_code combinations, there will be many columns under each key in
>>> this CF. If a combination has much much more columns than others, hot spot
>>> problem may arise.
>>>
>>>
>>>
>>> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Totaly newbie on Cassandra (with phpcassa) with big background on
>>>> relationned database, i'm would like to use Cassandra for a trivial case. So
>>>> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
>>>> wrong but i want to learn so i'm here
>>>>
>>>>
>>>> I would like your advise on a design for cassandra.
>>>>
>>>>
>>>> Case:
>>>>
>>>> - Users created Docs and can share docs with friends
>>>> - Users can read and share docs of their friends with other friends
>>>> - Docs can be of different type [text;picture;video;etc]
>>>> - Docs can be taggued
>>>>
>>>>
>>>>
>>>> Typical queries :
>>>>
>>>>
>>>> - Doc relative to tag
>>>> - Doc relative to mutiple tags
>>>> - Doc readed by user x
>>>> - Doc relative to tag and ratio readed_shared greater than x (see
>>>> design)
>>>> - All doc of type='IMG' favorized by my friend
>>>> - All doc of type='BOT' and c_bot_code='ABC'
>>>> - All doc of type='BOT' favorized by my friend relative (tag) with
>>>> 'fire' and 'belgium' ?
>>>>
>>>>
>>>>
>>>> Design :
>>>>
>>>>
>>>> docs // all docs
>>>> {
>>>>     ‘123456’: //id_docs
>>>>     {
>>>>         ‘t_info’:
>>>> {
>>>>  'c_type':'BOT'
>>>> 'b_del':'y'
>>>> 'b_publish':'y'
>>>>  }
>>>> 't_info_type':
>>>> {
>>>>  'l_title':'Hello World!'
>>>> 'c_bot_code':'ABC'
>>>>  }
>>>> 't_read_user' : //read by user x
>>>> {
>>>>  //time + id_user
>>>> '123456789_123':'123'
>>>> '123456789_155':'155'
>>>>  }
>>>> 't_shared_user' : //share by user x
>>>> {
>>>>  //time + id_user
>>>> '123456789_123':'123'
>>>> '123456789_155':'155'
>>>>  }
>>>> 't_tags'
>>>> {
>>>>  'fire':'fire'
>>>> 'belgium':'belgium'
>>>> }
>>>>  't_stats'
>>>> {
>>>> 'n_readed':'60'
>>>>  'n_shared':'6'
>>>> 'n_ratio_readed_shared':'0.1'
>>>>  }
>>>> }
>>>> }
>>>>
>>>>
>>>> tags_docs // all tag linked to docs
>>>> {
>>>> 'fire'://tag
>>>> {
>>>> //creation_time + id_docs
>>>>  '456789_123456':
>>>> {
>>>> 'id_doc':'123456'
>>>>  'time':'456789'
>>>> }
>>>> '456789_223456':'223456':
>>>>  {
>>>> 'id_doc':'123456'
>>>> 'time':'456789'
>>>>  }
>>>> '456789_323456':'223456':
>>>> {
>>>>  'id_doc':'123456'
>>>> 'time':'456789'
>>>> }
>>>>  }
>>>> 'belgium':
>>>> {
>>>>  ...
>>>> }
>>>> }
>>>>
>>>>
>>>> users // all users
>>>> {
>>>>     ‘123’: //id_user
>>>>     {
>>>>         ‘t_info’:
>>>> {
>>>>  l_name:'Boris'
>>>> c_lang='fr'
>>>>
>>>> }
>>>>  't_readed_docs':
>>>> {
>>>> //time + id_doc
>>>>  '123456789_123456':'123456'
>>>> '123458789_136456':'136456'
>>>>  }
>>>> 't_shared_docs':
>>>> {
>>>>  //time + id_doc
>>>> '123456789_123456':'123456'
>>>> '123458789_136456':'136456'
>>>>  }
>>>> }
>>>> }
>>>>
>>>>
>>>> users_docs // all action by users on docs
>>>> {
>>>>     ‘123_123456’: // id_user + id_doc
>>>>     {
>>>> 'id_doc':'123456'
>>>>  'id_user':'123'
>>>> 'd_readed':'20110301'
>>>> 'd_shared':'20110301'
>>>>  }
>>>> }
>>>>
>>>>
>>>> user_friends_act // all activity of user friends
>>>> {
>>>>     ‘123’:// id_user
>>>>     {
>>>> 't_readed_docs': //all docs readed by my friends
>>>> {
>>>> '223456_224_123456': // time + id_friend + id_docs
>>>>  {
>>>> 'id_friend':'224'
>>>> 'id_docs':'123456'
>>>>  'time':'223456'
>>>> 'c_type='BOT'
>>>>  }
>>>> }
>>>> 't_shared_docs': //all docs shared by my friends
>>>>  {
>>>> '223456_224_123456': // time + id_friend + id_docs
>>>> {
>>>>  'id_friend':'224'
>>>> 'id_docs':'123456'
>>>>  'time':'223456'
>>>> 'c_type='BOT'
>>>>  }
>>>> }
>>>> }
>>>> }
>>>>
>>>>
>>>>
>>>> I know that certain queries are not possible for now like : - All doc of
>>>> type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
>>>>
>>>>
>>>>
>>>> What do you think ?
>>>>
>>>>
>>>> Thank you,
>>>>
>>>>
>>>> Vodnok,
>>>>
>>>>
>>>> (Please remember i'm on cassandra since 3 days)
>>>>
>>>
>>>
>>
>

Re: Advice on a design

Posted by Burc Sade <bu...@gmail.com>.
You can use PHP Solr Extension. It is a fully featured and light-weight
client.

http://www.php.net/manual/en/book.solr.php

Without the secondary indexes on columns in CFs within SCFs, the best
approach is to create query-specific CFs at the moment. In the end all comes
down to how simple you can make your queries to have a minimum CF count.

Regards,
Burc

On Wed, Mar 2, 2011 at 9:06 AM, Vodnok <vo...@gmail.com> wrote:

> I think too via Solr it'll be easier. Just need to google it. (if you have
> links about Solr in php...)
>
> I realize that i have to remove some dimension to my CF...
>
> I thought it was possible to have SCF -> CF -> SC -> C:value having
> secondary index on C but has i understood, secondary index on C on super is
> not possible for now (but will be maybe in next version)
> As i understand it's better to have more less complex CF then less more
> complex CF
>
> Thank you for your reply,
>
>
>
> 2011/3/2 Burc Sade <bu...@gmail.com>
>
> Hi Vodnok,
>>
>> For tag searches I would use a search engine like Solr (Lucene), as I
>> think it would be more flexible to query. You can update the index as new
>> data comes in and query it for queries #1, #2 and #4.
>>
>> For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create the
>> CF below.
>>
>> doc_types
>> {
>>    'BOT:ABC':
>>   {
>>     <docid>: <creation_date?>
>>   }
>> }
>>
>> You can assign a value you are going to need when after querying to the
>> docid. The problem with this schema is that if there are not many
>> type:c_bot_code combinations, there will be many columns under each key in
>> this CF. If a combination has much much more columns than others, hot spot
>> problem may arise.
>>
>>
>>
>> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Totaly newbie on Cassandra (with phpcassa) with big background on
>>> relationned database, i'm would like to use Cassandra for a trivial case. So
>>> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
>>> wrong but i want to learn so i'm here
>>>
>>>
>>> I would like your advise on a design for cassandra.
>>>
>>>
>>> Case:
>>>
>>> - Users created Docs and can share docs with friends
>>> - Users can read and share docs of their friends with other friends
>>> - Docs can be of different type [text;picture;video;etc]
>>> - Docs can be taggued
>>>
>>>
>>>
>>> Typical queries :
>>>
>>>
>>> - Doc relative to tag
>>> - Doc relative to mutiple tags
>>> - Doc readed by user x
>>> - Doc relative to tag and ratio readed_shared greater than x (see design)
>>> - All doc of type='IMG' favorized by my friend
>>> - All doc of type='BOT' and c_bot_code='ABC'
>>> - All doc of type='BOT' favorized by my friend relative (tag) with 'fire'
>>> and 'belgium' ?
>>>
>>>
>>>
>>> Design :
>>>
>>>
>>> docs // all docs
>>> {
>>>     ‘123456’: //id_docs
>>>     {
>>>         ‘t_info’:
>>> {
>>>  'c_type':'BOT'
>>> 'b_del':'y'
>>> 'b_publish':'y'
>>>  }
>>> 't_info_type':
>>> {
>>>  'l_title':'Hello World!'
>>> 'c_bot_code':'ABC'
>>>  }
>>> 't_read_user' : //read by user x
>>> {
>>>  //time + id_user
>>> '123456789_123':'123'
>>> '123456789_155':'155'
>>>  }
>>> 't_shared_user' : //share by user x
>>> {
>>>  //time + id_user
>>> '123456789_123':'123'
>>> '123456789_155':'155'
>>>  }
>>> 't_tags'
>>> {
>>>  'fire':'fire'
>>> 'belgium':'belgium'
>>> }
>>>  't_stats'
>>> {
>>> 'n_readed':'60'
>>>  'n_shared':'6'
>>> 'n_ratio_readed_shared':'0.1'
>>>  }
>>> }
>>> }
>>>
>>>
>>> tags_docs // all tag linked to docs
>>> {
>>> 'fire'://tag
>>> {
>>> //creation_time + id_docs
>>>  '456789_123456':
>>> {
>>> 'id_doc':'123456'
>>>  'time':'456789'
>>> }
>>> '456789_223456':'223456':
>>>  {
>>> 'id_doc':'123456'
>>> 'time':'456789'
>>>  }
>>> '456789_323456':'223456':
>>> {
>>>  'id_doc':'123456'
>>> 'time':'456789'
>>> }
>>>  }
>>> 'belgium':
>>> {
>>>  ...
>>> }
>>> }
>>>
>>>
>>> users // all users
>>> {
>>>     ‘123’: //id_user
>>>     {
>>>         ‘t_info’:
>>> {
>>>  l_name:'Boris'
>>> c_lang='fr'
>>>
>>> }
>>>  't_readed_docs':
>>> {
>>> //time + id_doc
>>>  '123456789_123456':'123456'
>>> '123458789_136456':'136456'
>>>  }
>>> 't_shared_docs':
>>> {
>>>  //time + id_doc
>>> '123456789_123456':'123456'
>>> '123458789_136456':'136456'
>>>  }
>>> }
>>> }
>>>
>>>
>>> users_docs // all action by users on docs
>>> {
>>>     ‘123_123456’: // id_user + id_doc
>>>     {
>>> 'id_doc':'123456'
>>>  'id_user':'123'
>>> 'd_readed':'20110301'
>>> 'd_shared':'20110301'
>>>  }
>>> }
>>>
>>>
>>> user_friends_act // all activity of user friends
>>> {
>>>     ‘123’:// id_user
>>>     {
>>> 't_readed_docs': //all docs readed by my friends
>>> {
>>> '223456_224_123456': // time + id_friend + id_docs
>>>  {
>>> 'id_friend':'224'
>>> 'id_docs':'123456'
>>>  'time':'223456'
>>> 'c_type='BOT'
>>>  }
>>> }
>>> 't_shared_docs': //all docs shared by my friends
>>>  {
>>> '223456_224_123456': // time + id_friend + id_docs
>>> {
>>>  'id_friend':'224'
>>> 'id_docs':'123456'
>>>  'time':'223456'
>>> 'c_type='BOT'
>>>  }
>>> }
>>> }
>>> }
>>>
>>>
>>>
>>> I know that certain queries are not possible for now like : - All doc of
>>> type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
>>>
>>>
>>>
>>> What do you think ?
>>>
>>>
>>> Thank you,
>>>
>>>
>>> Vodnok,
>>>
>>>
>>> (Please remember i'm on cassandra since 3 days)
>>>
>>
>>
>

Re: Advice on a design

Posted by Vodnok <vo...@gmail.com>.
I think too via Solr it'll be easier. Just need to google it. (if you have
links about Solr in php...)

I realize that i have to remove some dimension to my CF...

I thought it was possible to have SCF -> CF -> SC -> C:value having
secondary index on C but has i understood, secondary index on C on super is
not possible for now (but will be maybe in next version)
As i understand it's better to have more less complex CF then less more
complex CF

Thank you for your reply,



2011/3/2 Burc Sade <bu...@gmail.com>

> Hi Vodnok,
>
> For tag searches I would use a search engine like Solr (Lucene), as I think
> it would be more flexible to query. You can update the index as new data
> comes in and query it for queries #1, #2 and #4.
>
> For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create the
> CF below.
>
> doc_types
> {
>    'BOT:ABC':
>   {
>     <docid>: <creation_date?>
>   }
> }
>
> You can assign a value you are going to need when after querying to the
> docid. The problem with this schema is that if there are not many
> type:c_bot_code combinations, there will be many columns under each key in
> this CF. If a combination has much much more columns than others, hot spot
> problem may arise.
>
>
>
> On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:
>
>> Hi,
>>
>> Totaly newbie on Cassandra (with phpcassa) with big background on
>> relationned database, i'm would like to use Cassandra for a trivial case. So
>> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
>> wrong but i want to learn so i'm here
>>
>>
>> I would like your advise on a design for cassandra.
>>
>>
>> Case:
>>
>> - Users created Docs and can share docs with friends
>> - Users can read and share docs of their friends with other friends
>> - Docs can be of different type [text;picture;video;etc]
>> - Docs can be taggued
>>
>>
>>
>> Typical queries :
>>
>>
>> - Doc relative to tag
>> - Doc relative to mutiple tags
>> - Doc readed by user x
>> - Doc relative to tag and ratio readed_shared greater than x (see design)
>> - All doc of type='IMG' favorized by my friend
>> - All doc of type='BOT' and c_bot_code='ABC'
>> - All doc of type='BOT' favorized by my friend relative (tag) with 'fire'
>> and 'belgium' ?
>>
>>
>>
>> Design :
>>
>>
>> docs // all docs
>> {
>>     ‘123456’: //id_docs
>>     {
>>         ‘t_info’:
>> {
>>  'c_type':'BOT'
>> 'b_del':'y'
>> 'b_publish':'y'
>>  }
>> 't_info_type':
>> {
>>  'l_title':'Hello World!'
>> 'c_bot_code':'ABC'
>>  }
>> 't_read_user' : //read by user x
>> {
>>  //time + id_user
>> '123456789_123':'123'
>> '123456789_155':'155'
>>  }
>> 't_shared_user' : //share by user x
>> {
>>  //time + id_user
>> '123456789_123':'123'
>> '123456789_155':'155'
>>  }
>> 't_tags'
>> {
>>  'fire':'fire'
>> 'belgium':'belgium'
>> }
>>  't_stats'
>> {
>> 'n_readed':'60'
>>  'n_shared':'6'
>> 'n_ratio_readed_shared':'0.1'
>>  }
>> }
>> }
>>
>>
>> tags_docs // all tag linked to docs
>> {
>> 'fire'://tag
>> {
>> //creation_time + id_docs
>>  '456789_123456':
>> {
>> 'id_doc':'123456'
>>  'time':'456789'
>> }
>> '456789_223456':'223456':
>>  {
>> 'id_doc':'123456'
>> 'time':'456789'
>>  }
>> '456789_323456':'223456':
>> {
>>  'id_doc':'123456'
>> 'time':'456789'
>> }
>>  }
>> 'belgium':
>> {
>>  ...
>> }
>> }
>>
>>
>> users // all users
>> {
>>     ‘123’: //id_user
>>     {
>>         ‘t_info’:
>> {
>>  l_name:'Boris'
>> c_lang='fr'
>>
>> }
>>  't_readed_docs':
>> {
>> //time + id_doc
>>  '123456789_123456':'123456'
>> '123458789_136456':'136456'
>>  }
>> 't_shared_docs':
>> {
>>  //time + id_doc
>> '123456789_123456':'123456'
>> '123458789_136456':'136456'
>>  }
>> }
>> }
>>
>>
>> users_docs // all action by users on docs
>> {
>>     ‘123_123456’: // id_user + id_doc
>>     {
>> 'id_doc':'123456'
>>  'id_user':'123'
>> 'd_readed':'20110301'
>> 'd_shared':'20110301'
>>  }
>> }
>>
>>
>> user_friends_act // all activity of user friends
>> {
>>     ‘123’:// id_user
>>     {
>> 't_readed_docs': //all docs readed by my friends
>> {
>> '223456_224_123456': // time + id_friend + id_docs
>>  {
>> 'id_friend':'224'
>> 'id_docs':'123456'
>>  'time':'223456'
>> 'c_type='BOT'
>>  }
>> }
>> 't_shared_docs': //all docs shared by my friends
>>  {
>> '223456_224_123456': // time + id_friend + id_docs
>> {
>>  'id_friend':'224'
>> 'id_docs':'123456'
>>  'time':'223456'
>> 'c_type='BOT'
>>  }
>> }
>> }
>> }
>>
>>
>>
>> I know that certain queries are not possible for now like : - All doc of
>> type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
>>
>>
>>
>> What do you think ?
>>
>>
>> Thank you,
>>
>>
>> Vodnok,
>>
>>
>> (Please remember i'm on cassandra since 3 days)
>>
>
>

Re: Advice on a design

Posted by Burc Sade <bu...@gmail.com>.
Hi Vodnok,

For tag searches I would use a search engine like Solr (Lucene), as I think
it would be more flexible to query. You can update the index as new data
comes in and query it for queries #1, #2 and #4.

For "All doc of type='BOT' and c_bot_code='ABC'" query, I would create the
CF below.

doc_types
{
   'BOT:ABC':
  {
    <docid>: <creation_date?>
  }
}

You can assign a value you are going to need when after querying to the
docid. The problem with this schema is that if there are not many
type:c_bot_code combinations, there will be many columns under each key in
this CF. If a combination has much much more columns than others, hot spot
problem may arise.



On Tue, Mar 1, 2011 at 11:39 PM, Vodnok <vo...@gmail.com> wrote:

> Hi,
>
> Totaly newbie on Cassandra (with phpcassa) with big background on
> relationned database, i'm would like to use Cassandra for a trivial case. So
> i'm on it since 3 days. Sorry for my stupid question. I'm pretty sure i'm
> wrong but i want to learn so i'm here
>
>
> I would like your advise on a design for cassandra.
>
>
> Case:
>
> - Users created Docs and can share docs with friends
> - Users can read and share docs of their friends with other friends
> - Docs can be of different type [text;picture;video;etc]
> - Docs can be taggued
>
>
>
> Typical queries :
>
>
> - Doc relative to tag
> - Doc relative to mutiple tags
> - Doc readed by user x
> - Doc relative to tag and ratio readed_shared greater than x (see design)
> - All doc of type='IMG' favorized by my friend
> - All doc of type='BOT' and c_bot_code='ABC'
> - All doc of type='BOT' favorized by my friend relative (tag) with 'fire'
> and 'belgium' ?
>
>
>
> Design :
>
>
> docs // all docs
> {
>     ‘123456’: //id_docs
>     {
>         ‘t_info’:
> {
>  'c_type':'BOT'
> 'b_del':'y'
> 'b_publish':'y'
>  }
> 't_info_type':
> {
>  'l_title':'Hello World!'
> 'c_bot_code':'ABC'
>  }
> 't_read_user' : //read by user x
> {
>  //time + id_user
> '123456789_123':'123'
> '123456789_155':'155'
>  }
> 't_shared_user' : //share by user x
> {
>  //time + id_user
> '123456789_123':'123'
> '123456789_155':'155'
>  }
> 't_tags'
> {
>  'fire':'fire'
> 'belgium':'belgium'
> }
>  't_stats'
> {
> 'n_readed':'60'
>  'n_shared':'6'
> 'n_ratio_readed_shared':'0.1'
>  }
> }
> }
>
>
> tags_docs // all tag linked to docs
> {
> 'fire'://tag
> {
> //creation_time + id_docs
>  '456789_123456':
> {
> 'id_doc':'123456'
>  'time':'456789'
> }
> '456789_223456':'223456':
>  {
> 'id_doc':'123456'
> 'time':'456789'
>  }
> '456789_323456':'223456':
> {
>  'id_doc':'123456'
> 'time':'456789'
> }
>  }
> 'belgium':
> {
>  ...
> }
> }
>
>
> users // all users
> {
>     ‘123’: //id_user
>     {
>         ‘t_info’:
> {
>  l_name:'Boris'
> c_lang='fr'
>
> }
>  't_readed_docs':
> {
> //time + id_doc
>  '123456789_123456':'123456'
> '123458789_136456':'136456'
>  }
> 't_shared_docs':
> {
>  //time + id_doc
> '123456789_123456':'123456'
> '123458789_136456':'136456'
>  }
> }
> }
>
>
> users_docs // all action by users on docs
> {
>     ‘123_123456’: // id_user + id_doc
>     {
> 'id_doc':'123456'
>  'id_user':'123'
> 'd_readed':'20110301'
> 'd_shared':'20110301'
>  }
> }
>
>
> user_friends_act // all activity of user friends
> {
>     ‘123’:// id_user
>     {
> 't_readed_docs': //all docs readed by my friends
> {
> '223456_224_123456': // time + id_friend + id_docs
>  {
> 'id_friend':'224'
> 'id_docs':'123456'
>  'time':'223456'
> 'c_type='BOT'
>  }
> }
> 't_shared_docs': //all docs shared by my friends
>  {
> '223456_224_123456': // time + id_friend + id_docs
> {
>  'id_friend':'224'
> 'id_docs':'123456'
>  'time':'223456'
> 'c_type='BOT'
>  }
> }
> }
> }
>
>
>
> I know that certain queries are not possible for now like : - All doc of
> type='BOT' favorized by my friend relative (tag) with 'fire' and 'belgium' ?
>
>
>
> What do you think ?
>
>
> Thank you,
>
>
> Vodnok,
>
>
> (Please remember i'm on cassandra since 3 days)
>