You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cassandra.apache.org by Mehar Chaitanya <me...@gmail.com> on 2010/01/29 13:16:24 UTC

Is this possible with cassandra

Hi All

I am a J2EE programmer only i had knowledge related to queries i will query
the sql where i can found the result.

How can i use cassandra for my requirement is it possible?

Below is my scenario

   - I have a table which contains columns like
   Category_name,Section_name,article,is_published_by  with  multiple records
   in the table.
   - I want to retrieve a query based on condition like belongs some
   category_name 'X' .
   - Same will be applied to other 3 ,condition based on Section and
   is_published_by


Please let me know if it would be possible.

Thanks&Regards,
Mehar Chaitanya Bandaru,
Software Engineer,
S cubes IT Solutions India Pvt. Ltd.,
http://www.scubian.com
(W) +91 4040307821,
(Cell) +91 9440 999 262,
#4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.

Re: Is this possible with cassandra

Posted by Nathan McCall <na...@vervewireless.com>.
Do not be afraid to duplicate data - the articles section of the
Cassandra wiki has some good use cases regarding this:

http://wiki.apache.org/cassandra/ArticlesAndPresentations

I found the following two helpful in getting my head around structuring my data:

http://about.digg.com/blog/looking-future-cassandra
http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model

We are working with cassandra for similar news-article ingestion and
storage and it works quite well given the speed of write throughput
and ease of scale and capacity planning.

Cheers,
-Nate


There are more linked off of the articles section

On Fri, Jan 29, 2010 at 7:09 AM, Mehar Chaitanya
<me...@gmail.com> wrote:
> Hi Jonathan
>
> Thanks for your reply. I had gone through the URL that you have specified.
> Let me put my problem statement with a clear statement:
>
> We have a RDBMS table that contains Category ID, Section ID, Article,
> IS_Published column. Now the application that we currently have uses SQL and
> gets the data in various forms e.g. get all the articles that belong to a
> section, get all the articles that belong a specific category, specific
> section and which is published and so on.
>
> With your example, I understand that it is possible for me to have multiple
> columnfamilies and store the same data e.g:
>
> keyspace.category[WORLDNEWS][SECTION] = HOCKEY
> keyspace.category[WORLDNEWS][ARTICLE] = World cup hockey matches begin...
> keyspace.category[WORLDNEWS][IS_PUBLISHED] = TRUE
>
> and another set as
> keyspace.section[HOCKEY][CATEGORY] = WORLDNEWS
> keyspace.section[HOCKEY][ARTICLE] = World cup hockey matches begin...
> keyspace.section[HOCKEY][IS_PUBLISHED] = TRUE
>
> Now, if the above example is correct then I have following questions:
>
>   1. This would lead to enourmous amount of duplication of data, in short
>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>   size would scale up tremendously.
>   2. Above way of reprensting the data would suffice if I want to retrieve
>   something like, get me all the articles whose category is WORLDNEWS. But
>   what if I want to find out something like: Get me all the articles whose
>   Section is BASEBALL and Category is WORLDNEWS. For addressing queries taht
>   depend on multiple parameter how do we do it? Hope I am clear with my
>   problem statement :(
>
> Please help me out in understanding this basic difference between
> interpreting data in RDBMS world v/s NRDBMS world.
>
>
> On Fri, Jan 29, 2010 at 8:00 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>
>> Cassandra does not support ad-hoc queries the way SQL does.  If you
>> want to ask "what rows have a column X containing value Y" then you
>> need to create a columnfamily whose keys are the values of X, and
>> whose columns are the keys of your original CF.
>>
>> Read http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model if
>> you haven't yet.
>>
>> On Fri, Jan 29, 2010 at 6:16 AM, Mehar Chaitanya
>> <me...@gmail.com> wrote:
>> > Hi All
>> >
>> > I am a J2EE programmer only i had knowledge related to queries i will
>> query
>> > the sql where i can found the result.
>> >
>> > How can i use cassandra for my requirement is it possible?
>> >
>> > Below is my scenario
>> >
>> >   - I have a table which contains columns like
>> >   Category_name,Section_name,article,is_published_by  with  multiple
>> records
>> >   in the table.
>> >   - I want to retrieve a query based on condition like belongs some
>> >   category_name 'X' .
>> >   - Same will be applied to other 3 ,condition based on Section and
>> >   is_published_by
>> >
>> >
>> > Please let me know if it would be possible.
>> >
>> > Thanks&Regards,
>> > Mehar Chaitanya Bandaru,
>> > Software Engineer,
>> > S cubes IT Solutions India Pvt. Ltd.,
>> > http://www.scubian.com
>> > (W) +91 4040307821,
>> > (Cell) +91 9440 999 262,
>> > #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>> >
>>
>

Re: bitmap slices

Posted by Jesse McConnell <je...@gmail.com>.
predicates for values would be nice, > < = and others would be quite useful

jesse

--
jesse mcconnell
jesse.mcconnell@gmail.com



On Mon, Feb 1, 2010 at 10:41, Jonathan Ellis <jb...@gmail.com> wrote:
> I don't think this is very useful for column names.  I could see it
> being useful for values but if we're going to add predicate queries
> then I'd rather do something more general.
>
> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>> On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>>
>> JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>>>> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>>>>
>> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>> JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
>> JE> <me...@gmail.com> wrote:
>>>>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>>>>>>   size would scale up tremendously.
>>>>
>> JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
>> JE> other things fast.
>>>>
>> TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>>>>
>> TZ> I searched the bug tracker and wiki and didn't find anything on the
>> TZ> topic of tag storage and search, so I don't think Cassandra supports
>> TZ> tags without data duplication.
>>>>
>> TZ> Would it be possible to implement an optional byte[] bitmap field in
>> TZ> SliceRange?  If you can specify the bitmap as an optional field it would
>> TZ> not break current clients.  Then the search can return only the subset
>> TZ> of the range that matches the bitmap.  This would make sense for
>> TZ> BytesType and LongType, at least.
>>>>
>>>> I looked at the source code and it seems that
>>>> StorageProxy::getSliceRange() is the focal point for reads and bitmap
>>>> matching should be implemented there.  The bitmap could be applied as a
>>>> filter before the other SliceRange parameters, especially the max number
>>>> of return results.  It may be worth the effort to send the bitmap down
>>>> to the ReadCommand/ColumnFamily level to reduce the number of potential
>>>> matches.
>>>>
>>>> If this is not feasible for technical reasons I'd like to know.
>>>> Otherwise I'll put it on my TODO list and produce a proposal (unless
>>>> someone more knowledgeable is interested, of course).
>>
>> JE> how would this be different then the byte[] column name you can
>> JE> already match on?
>>
>> Given byte columns
>>
>> A 0110
>> B 0111
>> C 0101
>>
>> the bitmask approach would let you specify a bitmask of "0011" and get
>> only B.  It's just an AND that looks for a non-zero value.  So you can
>> say "0111" and get A, B, and C.  Or "0010" to get A and B.  "1000" gets
>> nothing.
>>
>> Cassandra could support OR-ed multiples for better queries, so you could
>> ask for (0001,0010) to get A, B, and C.
>>
>> Ted
>>
>>
>

Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Thu, 4 Feb 2010 09:39:55 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/4 Ted Zlatanov <tz...@lifelogs.com>:
JE> The mask check needs to be done in the Slice Filter, not SP.
>> 
>> Sorry, I don't know what you mean.  Are you referring to
>> o.a.c.db.filter.SliceQueryFilter?  So I'd just add an extra parameter to
>> the constructor and change the matching logic?

JE> Right, but make it optional.

JE> All right, let's give it a try.

I created http://issues.apache.org/jira/browse/CASSANDRA-764 and put up
an unfinished patch to do the API support, with a request for advice.

I am still learning the internals so I'll take time to implement the
SliceQueryFilter matching.  If anyone more knowledgeable wants to do
that part, it would be appreciated and I can learn from it :)

Thanks
Ted


Re: bitmap slices

Posted by Jonathan Ellis <jb...@gmail.com>.
2010/2/4 Ted Zlatanov <tz...@lifelogs.com>:
> JE> The mask check needs to be done in the Slice Filter, not SP.
>
> Sorry, I don't know what you mean.  Are you referring to
> o.a.c.db.filter.SliceQueryFilter?  So I'd just add an extra parameter to
> the constructor and change the matching logic?

Right, but make it optional.

> JE> Is this actually powerful enough to solve a real problem for you?
>
> Yes!  OR+AND are exactly what I need.
>
> One specific situation: a supercolumn holds byte[] keys representing
> network addresses (IPv4, IPv6, and Infiniband).  I want to do efficient
> queries across them by various netmasks; the netmasks are not trivial
> and need the OR+AND structure.  Right now I do it all on the client
> side.  I can't break things down by key or by supercolumn further
> because I already use the supercolumn as a time (Long) index and the key
> represents the specific colo.

All right, let's give it a try.

Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Wed, 3 Feb 2010 17:06:32 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> It seems to me that the bitmask is only really useful for the
JE> SliceRange predicate.  Doing a predicate of "fetch these column names,
JE> but only if they match this mask" seems strange.

You're right, I put it at the wrong level.

JE> The mask check needs to be done in the Slice Filter, not SP.

Sorry, I don't know what you mean.  Are you referring to
o.a.c.db.filter.SliceQueryFilter?  So I'd just add an extra parameter to
the constructor and change the matching logic?  Or should I derive a new
class for backwards compatibility?

I also realized I should have just used a BitSet to do the matching
logic for me.

JE> Is this actually powerful enough to solve a real problem for you?

Yes!  OR+AND are exactly what I need.

One specific situation: a supercolumn holds byte[] keys representing
network addresses (IPv4, IPv6, and Infiniband).  I want to do efficient
queries across them by various netmasks; the netmasks are not trivial
and need the OR+AND structure.  Right now I do it all on the client
side.  I can't break things down by key or by supercolumn further
because I already use the supercolumn as a time (Long) index and the key
represents the specific colo.

I have other examples but they'd require too much explanation of my
company's infrastructure.

I've seen at least 4 real examples in the last 2 weeks on the dev and
user Cassandra mailing lists where some kind of data tagging could be
useful instead of just the range query.  The original article example
that started this thread, for instance, could have used that approach to
tag the articles and reduce the need for data duplication.

Thanks
Ted


Re: bitmap slices

Posted by Jonathan Ellis <jb...@gmail.com>.
It seems to me that the bitmask is only really useful for the
SliceRange predicate.  Doing a predicate of "fetch these column names,
but only if they match this mask" seems strange.

The mask check needs to be done in the Slice Filter, not SP.

Is this actually powerful enough to solve a real problem for you?

-Jonathan

2010/2/3 Ted Zlatanov <tz...@lifelogs.com>:
> On Mon, 1 Feb 2010 11:14:12 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>
> JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>>> On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>>>
> JE> I don't think this is very useful for column names.  I could see it
> JE> being useful for values but if we're going to add predicate queries
> JE> then I'd rather do something more general.
>>>
>>> Do you have any ideas?
>
> JE> Not really, no.  I think we're best served developing feature X by
> JE> starting with problems that can only be solved with X and working from
> JE> there.  Going the other direction is asking for trouble.
>
> I looked at the filters, e.g. o.a.c.db.filter.SliceQueryFilter, and it
> seems like one place to put predicate logic is in that hierarchy.
> Perhaps there can be a PredicateQueryFilter.  Some thought has
> apparently already gone into flexible filters at the storage level.  I
> hope something happens in this direction but I won't push for it
> further since it's not what I need.
>
> The attached patch is how I propose to do bitmasks inside the
> SlicePredicate.  As you suggested, it solves the specific problem.  It's
> pretty simple and carries no performance penalty if bitmasks are not
> used.  It's untested, intended to show the interface and approach I am
> proposing.  I didn't open an issue since it's unclear that this is the
> way to go.
>
> Thanks
> Ted
>
>

Re: predicate queries (was: bitmap slices)

Posted by Ryan Daum <ry...@thimbleware.com>.
On Mon, Feb 1, 2010 at 1:55 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:

> here's the thing, though, and this is a main reason why cassandra
> keeps things simple: doing work on the client typically results in
> *less* load on the server.

I agree with this in general, but while I'm not sure what it would
look like, having some support which would aide the construction of
secondary bitmap indexes could be very interesting.

>
> I think this is a non-starter.  It's pretty clear that the way forward
> is towards more programmatic apis, not clients translating their
> requests to strings which the server then parses.

+1
> -Jonathan
>

Re: predicate queries (was: bitmap slices)

Posted by Jonathan Ellis <jb...@gmail.com>.
2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
> My list of things I need for predicate queries across column and
> supercolumn names:
>
> - bitmask (OR AND1 AND2 AND3 ...).  This would make my life easier and
>  take load off our Cassandra servers.  Currently I have to scan the
>  result sets on the client side to find the things I need.

here's the thing, though, and this is a main reason why cassandra
keeps things simple: doing work on the client typically results in
*less* load on the server.

> FWIW I'd like an entirely text-based query language like SQL

I think this is a non-starter.  It's pretty clear that the way forward
is towards more programmatic apis, not clients translating their
requests to strings which the server then parses.

-Jonathan

predicate queries (was: bitmap slices)

Posted by Ted Zlatanov <tz...@lifelogs.com>.
My list of things I need for predicate queries across column and
supercolumn names:

- bitmask (OR AND1 AND2 AND3 ...).  This would make my life easier and
  take load off our Cassandra servers.  Currently I have to scan the
  result sets on the client side to find the things I need.

- date matching on Long supercolumns (find the last Friday, for example;
  the Calendar fields could be a good query language).  This is not so
  important.

FWIW I'd like an entirely text-based query language like SQL with the
whole query in one string, accessible at the top keyspace level and not
requiring other binary types.  Then you don't have to worry about
overengineering it one way or another, it simply supports whatever is
necessary.  You've already got some of the parser written for the
cassandra-cli stuff.

Another approach is to implement a JDBC driver with very limited
query functionality.

Ted


Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Mon, 1 Feb 2010 12:55:01 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>> My list of things I need for predicate queries across column and
>> supercolumn names:
>> 
>> - bitmask (OR AND1 AND2 AND3 ...).  This would make my life easier and
>>  take load off our Cassandra servers.  Currently I have to scan the
>>  result sets on the client side to find the things I need.

JE> here's the thing, though, and this is a main reason why cassandra
JE> keeps things simple: doing work on the client typically results in
JE> *less* load on the server.

I'm OK with loading the server a little more if it means the client gets
20 instead of 20K results.  Those that don't want to use this can do a
regular query, filter client-side, and keep load off the server.

Ted


Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Mon, 1 Feb 2010 11:14:12 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>> On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>> 
JE> I don't think this is very useful for column names.  I could see it
JE> being useful for values but if we're going to add predicate queries
JE> then I'd rather do something more general.
>> 
>> Do you have any ideas?

JE> Not really, no.  I think we're best served developing feature X by
JE> starting with problems that can only be solved with X and working from
JE> there.  Going the other direction is asking for trouble.

I looked at the filters, e.g. o.a.c.db.filter.SliceQueryFilter, and it
seems like one place to put predicate logic is in that hierarchy.
Perhaps there can be a PredicateQueryFilter.  Some thought has
apparently already gone into flexible filters at the storage level.  I
hope something happens in this direction but I won't push for it
further since it's not what I need.

The attached patch is how I propose to do bitmasks inside the
SlicePredicate.  As you suggested, it solves the specific problem.  It's
pretty simple and carries no performance penalty if bitmasks are not
used.  It's untested, intended to show the interface and approach I am
proposing.  I didn't open an issue since it's unclear that this is the
way to go.

Thanks
Ted


Re: bitmap slices

Posted by Jonathan Ellis <jb...@gmail.com>.
2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
> On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>
> JE> I don't think this is very useful for column names.  I could see it
> JE> being useful for values but if we're going to add predicate queries
> JE> then I'd rather do something more general.
>
> Do you have any ideas?

Not really, no.  I think we're best served developing feature X by
starting with problems that can only be solved with X and working from
there.  Going the other direction is asking for trouble.

-Jonathan

Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> I don't think this is very useful for column names.  I could see it
JE> being useful for values but if we're going to add predicate queries
JE> then I'd rather do something more general.

Do you have any ideas?  Are you thinking of a general query language
expressed as nested data items or of a string query parser that produces
a parse tree on the backend?  What capabilities will these queries have?

Ted


Re: bitmap slices

Posted by Jonathan Ellis <jb...@gmail.com>.
I don't think this is very useful for column names.  I could see it
being useful for values but if we're going to add predicate queries
then I'd rather do something more general.

2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
> On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
>
> JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>>> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>>>
> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
> JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
> JE> <me...@gmail.com> wrote:
>>>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>>>>>   size would scale up tremendously.
>>>
> JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
> JE> other things fast.
>>>
> TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>>>
> TZ> I searched the bug tracker and wiki and didn't find anything on the
> TZ> topic of tag storage and search, so I don't think Cassandra supports
> TZ> tags without data duplication.
>>>
> TZ> Would it be possible to implement an optional byte[] bitmap field in
> TZ> SliceRange?  If you can specify the bitmap as an optional field it would
> TZ> not break current clients.  Then the search can return only the subset
> TZ> of the range that matches the bitmap.  This would make sense for
> TZ> BytesType and LongType, at least.
>>>
>>> I looked at the source code and it seems that
>>> StorageProxy::getSliceRange() is the focal point for reads and bitmap
>>> matching should be implemented there.  The bitmap could be applied as a
>>> filter before the other SliceRange parameters, especially the max number
>>> of return results.  It may be worth the effort to send the bitmap down
>>> to the ReadCommand/ColumnFamily level to reduce the number of potential
>>> matches.
>>>
>>> If this is not feasible for technical reasons I'd like to know.
>>> Otherwise I'll put it on my TODO list and produce a proposal (unless
>>> someone more knowledgeable is interested, of course).
>
> JE> how would this be different then the byte[] column name you can
> JE> already match on?
>
> Given byte columns
>
> A 0110
> B 0111
> C 0101
>
> the bitmask approach would let you specify a bitmask of "0011" and get
> only B.  It's just an AND that looks for a non-zero value.  So you can
> say "0111" and get A, B, and C.  Or "0010" to get A and B.  "1000" gets
> nothing.
>
> Cassandra could support OR-ed multiples for better queries, so you could
> ask for (0001,0010) to get A, B, and C.
>
> Ted
>
>

Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> 2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
>> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>> 
TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
JE> <me...@gmail.com> wrote:
>>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>>>>   size would scale up tremendously.
>> 
JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
JE> other things fast.
>> 
TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>> 
TZ> I searched the bug tracker and wiki and didn't find anything on the
TZ> topic of tag storage and search, so I don't think Cassandra supports
TZ> tags without data duplication.
>> 
TZ> Would it be possible to implement an optional byte[] bitmap field in
TZ> SliceRange?  If you can specify the bitmap as an optional field it would
TZ> not break current clients.  Then the search can return only the subset
TZ> of the range that matches the bitmap.  This would make sense for
TZ> BytesType and LongType, at least.
>> 
>> I looked at the source code and it seems that
>> StorageProxy::getSliceRange() is the focal point for reads and bitmap
>> matching should be implemented there.  The bitmap could be applied as a
>> filter before the other SliceRange parameters, especially the max number
>> of return results.  It may be worth the effort to send the bitmap down
>> to the ReadCommand/ColumnFamily level to reduce the number of potential
>> matches.
>> 
>> If this is not feasible for technical reasons I'd like to know.
>> Otherwise I'll put it on my TODO list and produce a proposal (unless
>> someone more knowledgeable is interested, of course).

JE> how would this be different then the byte[] column name you can
JE> already match on?

Given byte columns

A 0110
B 0111
C 0101

the bitmask approach would let you specify a bitmask of "0011" and get
only B.  It's just an AND that looks for a non-zero value.  So you can
say "0111" and get A, B, and C.  Or "0010" to get A and B.  "1000" gets
nothing.

Cassandra could support OR-ed multiples for better queries, so you could
ask for (0001,0010) to get A, B, and C.

Ted


Re: bitmap slices

Posted by Jonathan Ellis <jb...@gmail.com>.
how would this be different then the byte[] column name you can
already match on?

2010/2/1 Ted Zlatanov <tz...@lifelogs.com>:
> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote:
>
> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote:
> JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
> JE> <me...@gmail.com> wrote:
>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>>>   size would scale up tremendously.
>
> JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
> JE> other things fast.
>
> TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>
> TZ> I searched the bug tracker and wiki and didn't find anything on the
> TZ> topic of tag storage and search, so I don't think Cassandra supports
> TZ> tags without data duplication.
>
> TZ> Would it be possible to implement an optional byte[] bitmap field in
> TZ> SliceRange?  If you can specify the bitmap as an optional field it would
> TZ> not break current clients.  Then the search can return only the subset
> TZ> of the range that matches the bitmap.  This would make sense for
> TZ> BytesType and LongType, at least.
>
> I looked at the source code and it seems that
> StorageProxy::getSliceRange() is the focal point for reads and bitmap
> matching should be implemented there.  The bitmap could be applied as a
> filter before the other SliceRange parameters, especially the max number
> of return results.  It may be worth the effort to send the bitmap down
> to the ReadCommand/ColumnFamily level to reduce the number of potential
> matches.
>
> If this is not feasible for technical reasons I'd like to know.
> Otherwise I'll put it on my TODO list and produce a proposal (unless
> someone more knowledgeable is interested, of course).
>
> Ted
>
>

Re: bitmap slices

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov <tz...@lifelogs.com> wrote: 

TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 
JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
JE> <me...@gmail.com> wrote:
>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>>   size would scale up tremendously.

JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
JE> other things fast.

TZ> IIUC, Mehar would be duplicating the article data for every article tag.

TZ> I searched the bug tracker and wiki and didn't find anything on the
TZ> topic of tag storage and search, so I don't think Cassandra supports
TZ> tags without data duplication.

TZ> Would it be possible to implement an optional byte[] bitmap field in
TZ> SliceRange?  If you can specify the bitmap as an optional field it would
TZ> not break current clients.  Then the search can return only the subset
TZ> of the range that matches the bitmap.  This would make sense for
TZ> BytesType and LongType, at least.

I looked at the source code and it seems that
StorageProxy::getSliceRange() is the focal point for reads and bitmap
matching should be implemented there.  The bitmap could be applied as a
filter before the other SliceRange parameters, especially the max number
of return results.  It may be worth the effort to send the bitmap down
to the ReadCommand/ColumnFamily level to reduce the number of potential
matches.

If this is not feasible for technical reasons I'd like to know.
Otherwise I'll put it on my TODO list and produce a proposal (unless
someone more knowledgeable is interested, of course).

Ted


bitmap slices (was: Is this possible with cassandra)

Posted by Ted Zlatanov <tz...@lifelogs.com>.
On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis <jb...@gmail.com> wrote: 

JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
JE> <me...@gmail.com> wrote:
>>   1. This would lead to enourmous amount of duplication of data, in short
>>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>>   size would scale up tremendously.

JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
JE> other things fast.

IIUC, Mehar would be duplicating the article data for every article tag.

I searched the bug tracker and wiki and didn't find anything on the
topic of tag storage and search, so I don't think Cassandra supports
tags without data duplication.

Would it be possible to implement an optional byte[] bitmap field in
SliceRange?  If you can specify the bitmap as an optional field it would
not break current clients.  Then the search can return only the subset
of the range that matches the bitmap.  This would make sense for
BytesType and LongType, at least.

I think this would be useful to me and perhaps to Mehar and others.  In
my setup, at least, I have many tags on my resources and a single tag
supercolumn per resource would be nice.

Ted


Re: Is this possible with cassandra

Posted by Mehar Chaitanya <me...@gmail.com>.
Hi Jonathan

I have gone through Lucandra , but i am a bit ambiguous for retrieving based
on multiple conditions.

In the previous example,


keyspace.category[WORLDNEWS][SECTION] = HOCKEY
keyspace.category[WORLDNEWS][ARTICLE] = World cup hockey matches begin...
keyspace.category[WORLDNEWS][IS_PUBLISHED] = TRUE

keyspace.category[WORLDNEWS][SECTION] = TENNIS
keyspace.category[WORLDNEWS][ARTICLE] = Australian Open...
keyspace.category[WORLDNEWS][IS_PUBLISHED] = TRUE


keyspace.category[WORLDNEWS][SECTION] = CRICKET
keyspace.category[WORLDNEWS][ARTICLE] = IPL 2010
keyspace.category[WORLDNEWS][IS_PUBLISHED] = FALSE


and another set as
keyspace.section[HOCKEY][CATEGORY] = WORLDNEWS
keyspace.section[HOCKEY][ARTICLE] = World cup hockey matches begin...
keyspace.section[HOCKEY][IS_PUBLISHED] = TRUE


keyspace.section[TENNIS][CATEGORY] = WORLDNEWS
keyspace.section[TENNIS][ARTICLE] = Australian Open...
keyspace.section[TENNIS][IS_PUBLISHED] = TRUE

keyspace.section[CRICKET][CATEGORY] = WORLDNEWS
keyspace.section[CRICKET][ARTICLE] = IPL 2010
keyspace.section[CRICKET][IS_PUBLISHED] = FALSE

I have two questions to ask
*
FIRST ONE*

If i want to retrieve the result as alll values which have IS_PUBLISHED is
TRUE and SECTION =CRICKET

In the above case i have to get two column families join ? Am i correct or

I have to get the two queries which have IS_PUBLISHED =TRUE in one query
and  SECTION =CRICKET in another after getting the result again define
another set and retrive that.


*SECOND ONE*

After some days passed and database was changed like IPL2010 is also
published its value is true

I read that we have only INSERTS no UPDATE or DELETE in some blog
http://www.vineetgupta.com/

so then what to do with old one ?

Correct me if i am going in a wrong

Sorry for my english .


On Fri, Jan 29, 2010 at 11:36 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
> <me...@gmail.com> wrote:
> >   1. This would lead to enourmous amount of duplication of data, in short
> >   if I now want to view the data from IS_PUBLISHED dimenstion then my
> database
> >   size would scale up tremendously.
>
> Yes.  But disk space is so cheap it's worth using a lot of it to make
> other things fast.
>
> >   2. Above way of reprensting the data would suffice if I want to
> retrieve
> >   something like, get me all the articles whose category is WORLDNEWS.
> But
> >   what if I want to find out something like: Get me all the articles
> whose
> >   Section is BASEBALL and Category is WORLDNEWS. For addressing queries
> taht
> >   depend on multiple parameter how do we do it? Hope I am clear with my
> >   problem statement :(
>
> You have to do the intersection client-side (or use something like
> http://github.com/tjake/Lucandra to do it for you).
>
> -Jonathan
>



-- 
The difference between possible and impossible lies in person's
determination.

Thanks&Regards,
Mehar Chaitanya Bandaru,
Software Engineer,
S cubes IT Solutions India Pvt. Ltd.,
http://www.scubian.com
(W) +91 4040307821,
(Cell) +91 9440 999 262,
#4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.

Re: Is this possible with cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
<me...@gmail.com> wrote:
>   1. This would lead to enourmous amount of duplication of data, in short
>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>   size would scale up tremendously.

Yes.  But disk space is so cheap it's worth using a lot of it to make
other things fast.

>   2. Above way of reprensting the data would suffice if I want to retrieve
>   something like, get me all the articles whose category is WORLDNEWS. But
>   what if I want to find out something like: Get me all the articles whose
>   Section is BASEBALL and Category is WORLDNEWS. For addressing queries taht
>   depend on multiple parameter how do we do it? Hope I am clear with my
>   problem statement :(

You have to do the intersection client-side (or use something like
http://github.com/tjake/Lucandra to do it for you).

-Jonathan

Re: Is this possible with cassandra

Posted by Mehar Chaitanya <me...@gmail.com>.
Hi Jonathan

Thanks for your reply. I had gone through the URL that you have specified.
Let me put my problem statement with a clear statement:

We have a RDBMS table that contains Category ID, Section ID, Article,
IS_Published column. Now the application that we currently have uses SQL and
gets the data in various forms e.g. get all the articles that belong to a
section, get all the articles that belong a specific category, specific
section and which is published and so on.

With your example, I understand that it is possible for me to have multiple
columnfamilies and store the same data e.g:

keyspace.category[WORLDNEWS][SECTION] = HOCKEY
keyspace.category[WORLDNEWS][ARTICLE] = World cup hockey matches begin...
keyspace.category[WORLDNEWS][IS_PUBLISHED] = TRUE

and another set as
keyspace.section[HOCKEY][CATEGORY] = WORLDNEWS
keyspace.section[HOCKEY][ARTICLE] = World cup hockey matches begin...
keyspace.section[HOCKEY][IS_PUBLISHED] = TRUE

Now, if the above example is correct then I have following questions:

   1. This would lead to enourmous amount of duplication of data, in short
   if I now want to view the data from IS_PUBLISHED dimenstion then my database
   size would scale up tremendously.
   2. Above way of reprensting the data would suffice if I want to retrieve
   something like, get me all the articles whose category is WORLDNEWS. But
   what if I want to find out something like: Get me all the articles whose
   Section is BASEBALL and Category is WORLDNEWS. For addressing queries taht
   depend on multiple parameter how do we do it? Hope I am clear with my
   problem statement :(

Please help me out in understanding this basic difference between
interpreting data in RDBMS world v/s NRDBMS world.


On Fri, Jan 29, 2010 at 8:00 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> Cassandra does not support ad-hoc queries the way SQL does.  If you
> want to ask "what rows have a column X containing value Y" then you
> need to create a columnfamily whose keys are the values of X, and
> whose columns are the keys of your original CF.
>
> Read http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model if
> you haven't yet.
>
> On Fri, Jan 29, 2010 at 6:16 AM, Mehar Chaitanya
> <me...@gmail.com> wrote:
> > Hi All
> >
> > I am a J2EE programmer only i had knowledge related to queries i will
> query
> > the sql where i can found the result.
> >
> > How can i use cassandra for my requirement is it possible?
> >
> > Below is my scenario
> >
> >   - I have a table which contains columns like
> >   Category_name,Section_name,article,is_published_by  with  multiple
> records
> >   in the table.
> >   - I want to retrieve a query based on condition like belongs some
> >   category_name 'X' .
> >   - Same will be applied to other 3 ,condition based on Section and
> >   is_published_by
> >
> >
> > Please let me know if it would be possible.
> >
> > Thanks&Regards,
> > Mehar Chaitanya Bandaru,
> > Software Engineer,
> > S cubes IT Solutions India Pvt. Ltd.,
> > http://www.scubian.com
> > (W) +91 4040307821,
> > (Cell) +91 9440 999 262,
> > #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
> >
>

Re: Is this possible with cassandra

Posted by Jonathan Ellis <jb...@gmail.com>.
Cassandra does not support ad-hoc queries the way SQL does.  If you
want to ask "what rows have a column X containing value Y" then you
need to create a columnfamily whose keys are the values of X, and
whose columns are the keys of your original CF.

Read http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model if
you haven't yet.

On Fri, Jan 29, 2010 at 6:16 AM, Mehar Chaitanya
<me...@gmail.com> wrote:
> Hi All
>
> I am a J2EE programmer only i had knowledge related to queries i will query
> the sql where i can found the result.
>
> How can i use cassandra for my requirement is it possible?
>
> Below is my scenario
>
>   - I have a table which contains columns like
>   Category_name,Section_name,article,is_published_by  with  multiple records
>   in the table.
>   - I want to retrieve a query based on condition like belongs some
>   category_name 'X' .
>   - Same will be applied to other 3 ,condition based on Section and
>   is_published_by
>
>
> Please let me know if it would be possible.
>
> Thanks&Regards,
> Mehar Chaitanya Bandaru,
> Software Engineer,
> S cubes IT Solutions India Pvt. Ltd.,
> http://www.scubian.com
> (W) +91 4040307821,
> (Cell) +91 9440 999 262,
> #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>