You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by "Hiller, Dean" <De...@nrel.gov> on 2012/10/02 15:33:08 UTC

Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...

Well, I think I know the direction we may follow so we can
1. Have Virtual CF's
2. Be able to map/reduce ONE Virtual CF

Well, not map/reduce exactly but really really close.  We use PlayOrm with
it's partitioning so I am now thinking what we will do is have a compute
grid  where we can have each node doing a findAll query into the
partitions it is responsible for.  In this way, I think we can 1000's of
virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves
the rows for that partition of one virtual CF.

Anyone know of a computer grid we can dish out work to?  That would be my
only missing piece (well, that and the PlayOrm virtual CF feature but I
can add that within a week probably though I am on vacation this Thursday
to monday).

Later,
Dean

On 10/2/12 6:35 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>So basically, with moving towards the 1000's of CF all being put in one
>CF, our performance is going to tank on map/reduce, correct?  I mean, from
>what I remember we could do map/reduce on a single CF, but by stuffing
>1000's of virtual Cf's into one CF, our map/reduce will have to read in
>all 999 virtual CF's rows that we don't want just to map/reduce the ONE
>CF.
>
>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.
>
>Is this correct?  This really sounds like highly undesirable behavior.
>There needs to be a way for people with 1000's of CF's to also run
>map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
>rows will be 1000 times slowerŠ.and of course, we will most likely get up
>to 20,000 tables from my most recent projectionsŠ.our last test load, we
>ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
>started getting really REALLY slow when we got up to 15k+ CF's in the
>system though I didn't look into why.
>
>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
>map/reduce "just" the virtual CF!!!!!  Ugh.
>
>Thanks,
>Dean
>
>On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:
>
>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>>wrote:
>>> Its just a convenient way of prefixing:
>>> 
>>>http://hector-client.github.com/hector/build/html/content/virtual_keyspa
>>>c
>>>es.html
>>
>>So given that it is possible to use a CF per tenant, should we assume
>>that there at sufficient scale that there is less overhead to prefix
>>keys than there is to manage multiple CFs?
>>
>>Ben
>

Re: 1000's of CF's. virtual CFs possible Map/Reduce SOLUTION...

Posted by Brian O'Neill <bo...@gmail.com>.

Dean,

We moved away from Hadoop and M/R, and instead we are using Storm as our
compute grid.  We queue keys in Kafka, then Storm distributes the work to
the grid.  Its working well so far, but we haven't taken it to prod yet.
Data is read from Cassandra using a Cassandra-bolt.

If you end up using Storm, let me know.  We have an unreleased version of
the bolt that you probably want to use.  (we're waiting on Nathan/Storm to
fix some classpath loading issues)

RE: a customer virtual keyspace Partitioner, point well taken

-brian

---
Brian O'Neill
Lead Architect, Software Development
 
Health Market Science
The Science of Better Results
2700 Horizon Drive ? King of Prussia, PA ? 19406
M: 215.588.6024 ? @boneill42 <http://www.twitter.com/boneill42>  ?
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 10/2/12 9:33 AM, "Hiller, Dean" <De...@nrel.gov> wrote:

>Well, I think I know the direction we may follow so we can
>1. Have Virtual CF's
>2. Be able to map/reduce ONE Virtual CF
>
>Well, not map/reduce exactly but really really close.  We use PlayOrm with
>it's partitioning so I am now thinking what we will do is have a compute
>grid  where we can have each node doing a findAll query into the
>partitions it is responsible for.  In this way, I think we can 1000's of
>virtual CF's inside ONE CF and then PlayOrm does it's query and retrieves
>the rows for that partition of one virtual CF.
>
>Anyone know of a computer grid we can dish out work to?  That would be my
>only missing piece (well, that and the PlayOrm virtual CF feature but I
>can add that within a week probably though I am on vacation this Thursday
>to monday).
>
>Later,
>Dean
>
>
>On 10/2/12 6:35 AM, "Hiller, Dean" <De...@nrel.gov> wrote:
>
>>So basically, with moving towards the 1000's of CF all being put in one
>>CF, our performance is going to tank on map/reduce, correct?  I mean,
>>from
>>what I remember we could do map/reduce on a single CF, but by stuffing
>>1000's of virtual Cf's into one CF, our map/reduce will have to read in
>>all 999 virtual CF's rows that we don't want just to map/reduce the ONE
>>CF.
>>
>>Map/reduce VERY VERY SLOW when reading in 1000 times more rows :( :(.
>>
>>Is this correct?  This really sounds like highly undesirable behavior.
>>There needs to be a way for people with 1000's of CF's to also run
>>map/reduce on any one CF.  Doing Map/reduce on 1000 times the number of
>>rows will be 1000 times slowerŠ.and of course, we will most likely get up
>>to 20,000 tables from my most recent projectionsŠ.our last test load, we
>>ended up with 8k+ CF's.  Since I kept two other keyspaces, cassandra
>>started getting really REALLY slow when we got up to 15k+ CF's in the
>>system though I didn't look into why.
>>
>>I don't mind having 1000's of virtual CF's in ONE CF, BUT I need to
>>map/reduce "just" the virtual CF!!!!!  Ugh.
>>
>>Thanks,
>>Dean
>>
>>On 10/1/12 3:38 PM, "Ben Hood" <0x...@gmail.com> wrote:
>>
>>>On Mon, Oct 1, 2012 at 9:38 PM, Brian O'Neill <bo...@alumni.brown.edu>
>>>wrote:
>>>> Its just a convenient way of prefixing:
>>>> 
>>>>http://hector-client.github.com/hector/build/html/content/virtual_keysp
>>>>a
>>>>c
>>>>es.html
>>>
>>>So given that it is possible to use a CF per tenant, should we assume
>>>that there at sufficient scale that there is less overhead to prefix
>>>keys than there is to manage multiple CFs?
>>>
>>>Ben
>>
>