You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Artur Siekielski <ar...@vhex.net> on 2015/09/11 09:53:17 UTC

Tag filtering data model

I store documents submitted by users, with optional tags (lists of strings):

CREATE TABLE doc (
   user_id uuid,
   date text, // part of partition key, to distribute data better
   doc_id uuid,
   tags list<text>,
   contents text,
   PRIMARY KEY((user_id, date), doc_id)
);

What is the best way to implement tag filtering? A user can select a 
list of tags and get documents with the tags. I thought about:

1) Full denormalization - include tags in the primary key and insert a 
doc for each subset of specified tags. This will however lead to large 
disk space usage, because there are 2**n subsets (for 10 tags and a 1MB 
doc 1000MB would be written).

2) Secondary index on 'tags' collection, and using queries like:
SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND 
tags CONTAINS=? ...

Since I will supply partition key value, I assume there will be no 
problems with contacting multiple nodes. But how well will it work for 
hundreds of thousands of results? I think intersection of tag matches 
needs to be performed in memory so it will not scale well.

3) Partial denormalization - do inserts for each single tag and then 
manually compute intersection. However in the worst case it can lead to 
scanning almost the whole table.

4) Full denormalization but without contents. I would get correct 
doc_ids fast, then I would need to use '... WHERE doc_id IN ?' with 
potentially a very large list of doc_ids.


What's Cassandra's way to implement this?

Re: Tag filtering data model

Posted by Artur Siekielski <ar...@vhex.net>.

I came to a similar conclusion, that is if you have more than a few 
tags, then the problem is no more simple "tagging" but more like regular 
"document search" with indexed words. There are too many word subsets to 
precompute matching documents, so you need to index documents 
individually and compute intersections dynamically. And for acceptable 
performance you need indexes stored fully in memory in data structures 
allowing computing intersections fast. This is not something regular 
databases implement (but they can be used as backing storage for indexes 
loaded into memory).

So the solution is to either limit the number of tags to 3-4 and do full 
denormalization (up to 8-16 times duplication factor) or use a search 
engine.

On 09/16/2015 11:29 AM, Naresh Yadav wrote:
> We also had similar usecase, after lot of trials with cassandra, we
> finally created solr schema doc_id(unique key), tags(indexed)
> in apache solr for answering search query "Get me matching docs by any
> given no of tags" and that solved our usecase. We had usecase of
> millions of docs and in tags we can have 100's of tags on a doc.
>
> Please share your final conclusion if you crack this problem within
> cassandra only, would be interested to know your solution.
>

Re: Tag filtering data model

Posted by Naresh Yadav <ny...@gmail.com>.

We also had similar usecase, after lot of trials with cassandra, we finally
created solr schema doc_id(unique key), tags(indexed)
in apache solr for answering search query "Get me matching docs by any
given no of tags" and that solved our usecase. We had usecase of millions
of docs and in tags we can have 100's of tags on a doc.

Please share your final conclusion if you crack this problem within
cassandra only, would be interested to know your solution.

On Fri, Sep 11, 2015 at 1:23 PM, Artur Siekielski <ar...@vhex.net> wrote:

> I store documents submitted by users, with optional tags (lists of
> strings):
>
> CREATE TABLE doc (
>   user_id uuid,
>   date text, // part of partition key, to distribute data better
>   doc_id uuid,
>   tags list<text>,
>   contents text,
>   PRIMARY KEY((user_id, date), doc_id)
> );
>
> What is the best way to implement tag filtering? A user can select a list
> of tags and get documents with the tags. I thought about:
>
> 1) Full denormalization - include tags in the primary key and insert a doc
> for each subset of specified tags. This will however lead to large disk
> space usage, because there are 2**n subsets (for 10 tags and a 1MB doc
> 1000MB would be written).
>
> 2) Secondary index on 'tags' collection, and using queries like:
> SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags
> CONTAINS=? ...
>
> Since I will supply partition key value, I assume there will be no
> problems with contacting multiple nodes. But how well will it work for
> hundreds of thousands of results? I think intersection of tag matches needs
> to be performed in memory so it will not scale well.
>
> 3) Partial denormalization - do inserts for each single tag and then
> manually compute intersection. However in the worst case it can lead to
> scanning almost the whole table.
>
> 4) Full denormalization but without contents. I would get correct doc_ids
> fast, then I would need to use '... WHERE doc_id IN ?' with potentially a
> very large list of doc_ids.
>
>
> What's Cassandra's way to implement this?
>

Re: Tag filtering data model

Posted by Carlos Alonso <in...@mrcalonso.com>.

Really interesting question Artur. Have you gone any further?

I think, based on my experience and recalling Cassandra's good practices,
that full denormalisation is the Cassandra way to go.

Cheers

Carlos Alonso | Software Engineer | @calonso <https://twitter.com/calonso>

On 11 September 2015 at 08:53, Artur Siekielski <ar...@vhex.net> wrote:

> I store documents submitted by users, with optional tags (lists of
> strings):
>
> CREATE TABLE doc (
>   user_id uuid,
>   date text, // part of partition key, to distribute data better
>   doc_id uuid,
>   tags list<text>,
>   contents text,
>   PRIMARY KEY((user_id, date), doc_id)
> );
>
> What is the best way to implement tag filtering? A user can select a list
> of tags and get documents with the tags. I thought about:
>
> 1) Full denormalization - include tags in the primary key and insert a doc
> for each subset of specified tags. This will however lead to large disk
> space usage, because there are 2**n subsets (for 10 tags and a 1MB doc
> 1000MB would be written).
>
> 2) Secondary index on 'tags' collection, and using queries like:
> SELECT * FROM doc WHERE user_id=? AND date=? AND tags CONTAINS=? AND tags
> CONTAINS=? ...
>
> Since I will supply partition key value, I assume there will be no
> problems with contacting multiple nodes. But how well will it work for
> hundreds of thousands of results? I think intersection of tag matches needs
> to be performed in memory so it will not scale well.
>
> 3) Partial denormalization - do inserts for each single tag and then
> manually compute intersection. However in the worst case it can lead to
> scanning almost the whole table.
>
> 4) Full denormalization but without contents. I would get correct doc_ids
> fast, then I would need to use '... WHERE doc_id IN ?' with potentially a
> very large list of doc_ids.
>
>
> What's Cassandra's way to implement this?
>