You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Donald Smith <Do...@audiencescience.com> on 2014/08/30 04:09:40 UTC

Rebuilding a cassandra seed node with the same tokens and same IP address

One of our nodes is getting an increasing number of pending compactions due, we think, to

https://issues.apache.org/jira/browse/CASSANDRA-7145 , which is fixed in future version 2.0.11 . (We had the same error a month ago, but at that time we were in pre-production and could just clean the disks on all the nodes and restart. Now we want to be cleverer.)

To overcome the issue we figure we should just rebuild the node using the same token range, to avoid unneeded data reshuffling. So we figure we should (1) find the tokens in use on that node via "nodetool ring", (2) stop cassandra on that node, (3) delete the data directory, (4) Use the tokens saved in step (1) as the initial_token list, and (5) restart the node.

But the node is a seed node and cassandra won't bootstrap seed nodes. Perhaps removing that node's address from the seeds list on the other nodes (and on that node) will be sufficient. That's what Replacing a Dead Seed Node<http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_seed_node.html> suggests. Perhaps I can remove the ip address from the seeds list on all nodes in the cluster, restart all the nodes, and then restart the bad node with auto_bootstrap=true.

I want to use the same IP address. and so I don't think I can follow the instructions at

http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html, because it assumes the IP address of the dead node and the new node differ.

If I just start it up it will start serving traffic and read requests will fail. It wouldn't be the end of the world (the production use isn't critical yet).

Should we use "nodetool rebuild $LOCAL_DC"? (though I think that's mostly for adding a data center) Should I add it back in and do "nodetool repair"? I'm afraid that would be too slow.

Again, don't want to REMOVE the node from the cluster: that would cause reshuffling of token ranges and data. I want to use the same token range.

Any suggestions?

Thanks, Don

Re: Rebuilding a cassandra seed node with the same tokens and same IP address

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Aug 29, 2014 at 7:09 PM, Donald Smith <
Donald.Smith@audiencescience.com> wrote:

>  But the node is a seed node and cassandra won't bootstrap seed nodes.
> Perhaps removing that node's address from the seeds list on the other nodes
> (and on that node) will be sufficient. That's what Replacing a Dead Seed
> Node
> <http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_seed_node.html>
> suggests. Perhaps I can remove the ip address from the seeds list on all
> nodes in the cluster, restart all the nodes, and then restart the bad node
> with auto_bootstrap=true.
>
Just temporarily remove it it from its own seed list and use
replace_address with auto_bootstrap=true. You need replace_address to
bootstrap the node into the range it already owns.

The fact that you don't have to remove it from the other nodes' seed lists
suggests that there is something fundamentally confused about "seed nodes
cannot bootstrap" implementation detail.

https://issues.apache.org/jira/browse/CASSANDRA-5836?focusedCommentId=13727032&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13727032

=Rob

Re: Machine Learning With Cassandra

Posted by Peter Lin <wo...@gmail.com>.

there are other machine learning frameworks that scale better than hadoop +
mahout

http://hunch.net/~vw/

if the kind of machine learning you're doing is really large and speed
matters, take a look at vowpal wabbit




On Sat, Aug 30, 2014 at 4:58 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Ahh thanks. Yeah my searches for “machine learning with Cassandra” were
> not turning up much useful stuff.
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* James Horey <jl...@opencore.io>
> *Sent:* Saturday, August 30, 2014 3:34 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  If you want distributed machine learning, you can use either Mahout
> (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax
> provides a connector (CFS) to interact with data stored in Cassandra.
> Otherwise you can try to use the Cassandra InputFormat (not as simple, but
> plenty of people use it).
>
> A quick search for “map reduce cassandra” on this list brings up a recent
> conversation:
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
> <http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7=Mi8q=5SmRfNM1+KFEXXVk+p8iw@mail.gmail.com%3E>
>
>
> If you prefer to use Spark, you can try the Datastax Cassandra connector:
> https://github.com/datastax/spark-cassandra-connector. This should let
> you run Spark jobs using data to/from Cassandra.
>
> Cheers,
> James
>
> Web: http://ferry.opencore.io
> Twitter: @open_core_io
>
>  On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>   Yes I remember this conversation. That was when I was just first
> stepping into this stuff. My current understanding is:
> Storm = Stream and micro batch
> Spark  = Batch and micro batch
>
> Micro batching is what gets you to exactly once processing semantics. I’m
> clear on that. What I’m not clear on is how and where processing takes
> place.
>
> I also get the fact that Spark is a faster execution engine than
> MapReduce. But we have Tez now..except, as far as I know, that’s not useful
> here because my data isn’t in HDFS. People seem to be talking quite a bit
> about Mahout and Spark Shell but I’d really like to get this done with a
> minimum amount of software; either Storm or Spark but not both.
>
> Trident ML isn’t distributed which is fine because I’m not trying to do
> learning on the stream. For now, I’m just trying to do learning in batch
> and then update parameters as suggested earlier.
>
> Let me simply the question. How do I do distributed machine learning when
> my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet
> but a lot of the algorithms run on MapReduce which is fine for now. As I
> understand it though, MapReduce works on data in HDFS correct?
>
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>
>  *From:* Shahab Yunus <sh...@gmail.com>
> *Sent:* Saturday, August 30, 2014 11:23 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  Spark is not storage, rather it is a streaming framework supposed to be
> run on big data, distributed architecture (a very high-level
> intro/definition). It provides batched version of in-memory map/reduce like
> jobs. It is not completely streaming like Storm but rather batches
> collection of tuples and thus you can run complex ML algorithms relatively
> faster.
>
> I think we just discussed this a short while ago when similar question
> (storm vs. spark, I think) was raised by you earlier. Here is the link for
> that discussion:
> http://markmail.org/message/lc4icuw4hobul6oh
>
>
> Regards,
> Shahab
>
>
> On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Isn’t a bit overkill to use Storm and Spark in the architecture? You
>> say load it “into” Spark. Is Spark separate storage?
>>
>> B.
>>
>>  *From:* Alex Kamil <al...@gmail.com>
>> *Sent:* Friday, August 29, 2014 10:46 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: Machine Learning With Cassandra
>>
>>  Adaryl,
>>
>> most ML algorithms  are based on some form of numerical optimization,
>> using something like online gradient descent
>> <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
>> gradient
>> <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html>
>> (e.g in SVM classifiers). In its simplest form it is a nested FOR loop
>> where on each iteration you update the weights or parameters of the model
>> until reaching some convergence threshold that minimizes the prediction
>> error (usually the goal is to minimize  a Loss function
>> <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least
>> squares <http://en.wikipedia.org/wiki/Least_squares> technique). You
>> could parallelize this loop using a brute force divide-and-conquer
>> approach, mapping a chunk of data to each node and a computing partial sum
>> there, then aggregating the results from each node into a global sum in a
>> 'reduce' stage, and repeating this map-reduce cycle until convergence. You
>> can look up distributed gradient descent
>> <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc>
>> or check out Mahout
>> <https://mahout.apache.org/users/recommender/matrix-factorization.html>
>> or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html>
>> for examples. Alternatively you can use something like GraphLab
>> <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
>> .
>>
>> Cassandra can serve a data store from which you load the training data
>> e.g. into Spark  using this connector
>> <https://github.com/datastax/spark-cassandra-connector> and then train
>> the model using MLlib or Mahout (it has Spark bindings I believe). Once you
>> trained the model, you could save the parameters back in Cassandra. Then
>> the next stage is using the model to classify new data, e.g. recommend
>> similar items based on a log of new purchases, there you could once again
>> use Spark or Storm with something like this
>> <https://github.com/pmerienne/trident-ml>.
>>
>> Alex
>>
>>
>>
>>
>> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   I’m planning to speak at a local meet-up and I need to know if what I
>>> have in my head is even possible.
>>>
>>>  I want to give an example of working with data in Cassandra. I have
>>> data coming in through Kafka and Storm and I’m saving it off to Cassandra
>>> (this is only on paper at this point). I then want to run an ML algorithm
>>> over the data. My problem here is, while my data is distributed, I don’t
>>> know how to do the analysis in a distributed manner. I could certainly use
>>> R but processing the data on a single machine would seem to defeat the
>>> purpose of all this scalability.
>>>
>>>  What is my solution?
>>>  B.
>>>
>>
>>
>
>
>
>

Re: Machine Learning With Cassandra

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Ahh thanks. Yeah my searches for “machine learning with Cassandra” were not turning up much useful stuff.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: James Horey 
Sent: Saturday, August 30, 2014 3:34 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

If you want distributed machine learning, you can use either Mahout (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a connector (CFS) to interact with data stored in Cassandra. Otherwise you can try to use the Cassandra InputFormat (not as simple, but plenty of people use it). 

A quick search for “map reduce cassandra” on this list brings up a recent conversation: http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E 

If you prefer to use Spark, you can try the Datastax Cassandra connector: https://github.com/datastax/spark-cassandra-connector. This should let you run Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io 
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA <ad...@hotmail.com> wrote:


  Yes I remember this conversation. That was when I was just first stepping into this stuff. My current understanding is:
  Storm = Stream and micro batch
  Spark  = Batch and micro batch

  Micro batching is what gets you to exactly once processing semantics. I’m clear on that. What I’m not clear on is how and where processing takes place.

  I also get the fact that Spark is a faster execution engine than MapReduce. But we have Tez now..except, as far as I know, that’s not useful here because my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and Spark Shell but I’d really like to get this done with a minimum amount of software; either Storm or Spark but not both.  

  Trident ML isn’t distributed which is fine because I’m not trying to do learning on the stream. For now, I’m just trying to do learning in batch and then update parameters as suggested earlier.

  Let me simply the question. How do I do distributed machine learning when my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a lot of the algorithms run on MapReduce which is fine for now. As I understand it though, MapReduce works on data in HDFS correct?

  Adaryl "Bob" Wakefield, MBA
  Principal
  Mass Street Analytics
  913.938.6685
  www.linkedin.com/in/bobwakefieldmba
  Twitter: @BobLovesData

  From: Shahab Yunus 
  Sent: Saturday, August 30, 2014 11:23 AM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed architecture (a very high-level intro/definition). It provides batched version of in-memory map/reduce like jobs. It is not completely streaming like Storm but rather batches collection of tuples and thus you can run complex ML algorithms relatively faster.  

  I think we just discussed this a short while ago when similar question (storm vs. spark, I think) was raised by you earlier. Here is the link for that discussion:
  http://markmail.org/message/lc4icuw4hobul6oh



  Regards,
  Shahab



  On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into” Spark. Is Spark separate storage?

    B.

    From: Alex Kamil 
    Sent: Friday, August 29, 2014 10:46 PM
    To: user@cassandra.apache.org 
    Subject: Re: Machine Learning With Cassandra

    Adaryl, 

    most ML algorithms  are based on some form of numerical optimization, using something like online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form it is a nested FOR loop where on each iteration you update the weights or parameters of the model until reaching some convergence threshold that minimizes the prediction error (usually the goal is to minimize  a Loss function, as in a popular least squares technique). You could parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of data to each node and a computing partial sum there, then aggregating the results from each node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence. You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples. Alternatively you can use something like GraphLab.

    Cassandra can serve a data store from which you load the training data e.g. into Spark  using this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe). Once you trained the model, you could save the parameters back in Cassandra. Then the next stage is using the model to classify new data, e.g. recommend similar items based on a log of new purchases, there you could once again use Spark or Storm with something like this.

    Alex





    On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible.
      I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then want to run an ML algorithm over the data. My problem here is, while my data is distributed, I don’t know how to do the analysis in a distributed manner. I could certainly use R but processing the data on a single machine would seem to defeat the purpose of all this scalability.
      What is my solution?
      B.

Re: Machine Learning With Cassandra

Posted by James Horey <jl...@opencore.io>.

If you want distributed machine learning, you can use either Mahout (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a connector (CFS) to interact with data stored in Cassandra. Otherwise you can try to use the Cassandra InputFormat (not as simple, but plenty of people use it). 

A quick search for “map reduce cassandra” on this list brings up a recent conversation: http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E 

If you prefer to use Spark, you can try the Datastax Cassandra connector: https://github.com/datastax/spark-cassandra-connector. This should let you run Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA <ad...@hotmail.com> wrote:

> Yes I remember this conversation. That was when I was just first stepping into this stuff. My current understanding is:
> Storm = Stream and micro batch
> Spark  = Batch and micro batch
>  
> Micro batching is what gets you to exactly once processing semantics. I’m clear on that. What I’m not clear on is how and where processing takes place.
>  
> I also get the fact that Spark is a faster execution engine than MapReduce. But we have Tez now..except, as far as I know, that’s not useful here because my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and Spark Shell but I’d really like to get this done with a minimum amount of software; either Storm or Spark but not both. 
>  
> Trident ML isn’t distributed which is fine because I’m not trying to do learning on the stream. For now, I’m just trying to do learning in batch and then update parameters as suggested earlier.
>  
> Let me simply the question. How do I do distributed machine learning when my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a lot of the algorithms run on MapReduce which is fine for now. As I understand it though, MapReduce works on data in HDFS correct?
>  
> Adaryl "Bob" Wakefield, MBA
> Principal
> Mass Street Analytics
> 913.938.6685
> www.linkedin.com/in/bobwakefieldmba
> Twitter: @BobLovesData
>  
> From: Shahab Yunus
> Sent: Saturday, August 30, 2014 11:23 AM
> To: user@cassandra.apache.org
> Subject: Re: Machine Learning With Cassandra
>  
> Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed architecture (a very high-level intro/definition). It provides batched version of in-memory map/reduce like jobs. It is not completely streaming like Storm but rather batches collection of tuples and thus you can run complex ML algorithms relatively faster. 
>  
> I think we just discussed this a short while ago when similar question (storm vs. spark, I think) was raised by you earlier. Here is the link for that discussion:
> http://markmail.org/message/lc4icuw4hobul6oh
>  
>  
> Regards,
> Shahab
> 
> 
> On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
> Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into” Spark. Is Spark separate storage?
>  
> B.
>  
> From: Alex Kamil
> Sent: Friday, August 29, 2014 10:46 PM
> To: user@cassandra.apache.org
> Subject: Re: Machine Learning With Cassandra
>  
> Adaryl,
>  
> most ML algorithms  are based on some form of numerical optimization, using something like online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form it is a nested FOR loop where on each iteration you update the weights or parameters of the model until reaching some convergence threshold that minimizes the prediction error (usually the goal is to minimize  a Loss function, as in a popular least squares technique). You could parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of data to each node and a computing partial sum there, then aggregating the results from each node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence. You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples. Alternatively you can use something like GraphLab.
>  
> Cassandra can serve a data store from which you load the training data e.g. into Spark  using this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe). Once you trained the model, you could save the parameters back in Cassandra. Then the next stage is using the model to classify new data, e.g. recommend similar items based on a log of new purchases, there you could once again use Spark or Storm with something like this.
>  
> Alex
>  
>  
> 
> 
> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:
> I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible.
>  
> I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then want to run an ML algorithm over the data. My problem here is, while my data is distributed, I don’t know how to do the analysis in a distributed manner. I could certainly use R but processing the data on a single machine would seem to defeat the purpose of all this scalability.
>  
> What is my solution?
> B.
>  
>

Re: Machine Learning With Cassandra

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Yes I remember this conversation. That was when I was just first stepping into this stuff. My current understanding is:
Storm = Stream and micro batch
Spark  = Batch and micro batch

Micro batching is what gets you to exactly once processing semantics. I’m clear on that. What I’m not clear on is how and where processing takes place.

I also get the fact that Spark is a faster execution engine than MapReduce. But we have Tez now..except, as far as I know, that’s not useful here because my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and Spark Shell but I’d really like to get this done with a minimum amount of software; either Storm or Spark but not both.  

Trident ML isn’t distributed which is fine because I’m not trying to do learning on the stream. For now, I’m just trying to do learning in batch and then update parameters as suggested earlier.

Let me simply the question. How do I do distributed machine learning when my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a lot of the algorithms run on MapReduce which is fine for now. As I understand it though, MapReduce works on data in HDFS correct?

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics
913.938.6685
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData

From: Shahab Yunus 
Sent: Saturday, August 30, 2014 11:23 AM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed architecture (a very high-level intro/definition). It provides batched version of in-memory map/reduce like jobs. It is not completely streaming like Storm but rather batches collection of tuples and thus you can run complex ML algorithms relatively faster.  

I think we just discussed this a short while ago when similar question (storm vs. spark, I think) was raised by you earlier. Here is the link for that discussion:
http://markmail.org/message/lc4icuw4hobul6oh



Regards,
Shahab



On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into” Spark. Is Spark separate storage?

  B.

  From: Alex Kamil 
  Sent: Friday, August 29, 2014 10:46 PM
  To: user@cassandra.apache.org 
  Subject: Re: Machine Learning With Cassandra

  Adaryl, 

  most ML algorithms  are based on some form of numerical optimization, using something like online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form it is a nested FOR loop where on each iteration you update the weights or parameters of the model until reaching some convergence threshold that minimizes the prediction error (usually the goal is to minimize  a Loss function, as in a popular least squares technique). You could parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of data to each node and a computing partial sum there, then aggregating the results from each node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence. You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples. Alternatively you can use something like GraphLab.

  Cassandra can serve a data store from which you load the training data e.g. into Spark  using this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe). Once you trained the model, you could save the parameters back in Cassandra. Then the next stage is using the model to classify new data, e.g. recommend similar items based on a log of new purchases, there you could once again use Spark or Storm with something like this.

  Alex





  On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible.
    I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then want to run an ML algorithm over the data. My problem here is, while my data is distributed, I don’t know how to do the analysis in a distributed manner. I could certainly use R but processing the data on a single machine would seem to defeat the purpose of all this scalability.
    What is my solution?
    B.

Re: Machine Learning With Cassandra

Posted by Shahab Yunus <sh...@gmail.com>.

Spark is not storage, rather it is a streaming framework supposed to be run
on big data, distributed architecture (a very high-level intro/definition).
It provides batched version of in-memory map/reduce like jobs. It is not
completely streaming like Storm but rather batches collection of tuples and
thus you can run complex ML algorithms relatively faster.

I think we just discussed this a short while ago when similar question
(storm vs. spark, I think) was raised by you earlier. Here is the link for
that discussion:
http://markmail.org/message/lc4icuw4hobul6oh


Regards,
Shahab


On Sat, Aug 30, 2014 at 12:16 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Isn’t a bit overkill to use Storm and Spark in the architecture? You
> say load it “into” Spark. Is Spark separate storage?
>
> B.
>
>  *From:* Alex Kamil <al...@gmail.com>
> *Sent:* Friday, August 29, 2014 10:46 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Machine Learning With Cassandra
>
>  Adaryl,
>
> most ML algorithms  are based on some form of numerical optimization,
> using something like online gradient descent
> <http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
> gradient
> <http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> (e.g
> in SVM classifiers). In its simplest form it is a nested FOR loop where on
> each iteration you update the weights or parameters of the model until
> reaching some convergence threshold that minimizes the prediction error
> (usually the goal is to minimize  a Loss function
> <http://en.wikipedia.org/wiki/Loss_function>, as in a popular least
> squares <http://en.wikipedia.org/wiki/Least_squares> technique). You
> could parallelize this loop using a brute force divide-and-conquer
> approach, mapping a chunk of data to each node and a computing partial sum
> there, then aggregating the results from each node into a global sum in a
> 'reduce' stage, and repeating this map-reduce cycle until convergence. You
> can look up distributed gradient descent
> <http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc>
> or check out Mahout
> <https://mahout.apache.org/users/recommender/matrix-factorization.html>
> or Spark MLlib <https://spark.apache.org/docs/latest/mllib-guide.html>
> for examples. Alternatively you can use something like GraphLab
> <http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
> .
>
> Cassandra can serve a data store from which you load the training data
> e.g. into Spark  using this connector
> <https://github.com/datastax/spark-cassandra-connector> and then train
> the model using MLlib or Mahout (it has Spark bindings I believe). Once you
> trained the model, you could save the parameters back in Cassandra. Then
> the next stage is using the model to classify new data, e.g. recommend
> similar items based on a log of new purchases, there you could once again
> use Spark or Storm with something like this
> <https://github.com/pmerienne/trident-ml>.
>
> Alex
>
>
>
>
> On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   I’m planning to speak at a local meet-up and I need to know if what I
>> have in my head is even possible.
>>
>>  I want to give an example of working with data in Cassandra. I have
>> data coming in through Kafka and Storm and I’m saving it off to Cassandra
>> (this is only on paper at this point). I then want to run an ML algorithm
>> over the data. My problem here is, while my data is distributed, I don’t
>> know how to do the analysis in a distributed manner. I could certainly use
>> R but processing the data on a single machine would seem to defeat the
>> purpose of all this scalability.
>>
>>  What is my solution?
>>  B.
>>
>
>

Re: Machine Learning With Cassandra

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into” Spark. Is Spark separate storage?

B.

From: Alex Kamil 
Sent: Friday, August 29, 2014 10:46 PM
To: user@cassandra.apache.org 
Subject: Re: Machine Learning With Cassandra

Adaryl, 

most ML algorithms  are based on some form of numerical optimization, using something like online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form it is a nested FOR loop where on each iteration you update the weights or parameters of the model until reaching some convergence threshold that minimizes the prediction error (usually the goal is to minimize  a Loss function, as in a popular least squares technique). You could parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of data to each node and a computing partial sum there, then aggregating the results from each node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence. You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples. Alternatively you can use something like GraphLab.

Cassandra can serve a data store from which you load the training data e.g. into Spark  using this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe). Once you trained the model, you could save the parameters back in Cassandra. Then the next stage is using the model to classify new data, e.g. recommend similar items based on a log of new purchases, there you could once again use Spark or Storm with something like this.

Alex

On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

  I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible.
  I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then want to run an ML algorithm over the data. My problem here is, while my data is distributed, I don’t know how to do the analysis in a distributed manner. I could certainly use R but processing the data on a single machine would seem to defeat the purpose of all this scalability.
  What is my solution?
  B.

Re: Machine Learning With Cassandra

Posted by Alex Kamil <al...@gmail.com>.

Adaryl,

most ML algorithms  are based on some form of numerical optimization, using
something like online gradient descent
<http://en.wikipedia.org/wiki/Stochastic_gradient_descent> or conjugate
gradient
<http://www.math.buffalo.edu/~pitman/courses/cor502/odes/node4.html> (e.g
in SVM classifiers). In its simplest form it is a nested FOR loop where on
each iteration you update the weights or parameters of the model until
reaching some convergence threshold that minimizes the prediction error
(usually the goal is to minimize  a Loss function
<http://en.wikipedia.org/wiki/Loss_function>, as in a popular least squares
<http://en.wikipedia.org/wiki/Least_squares> technique). You could
parallelize this loop using a brute force divide-and-conquer approach,
mapping a chunk of data to each node and a computing partial sum there,
then aggregating the results from each node into a global sum in a 'reduce'
stage, and repeating this map-reduce cycle until convergence. You can look
up distributed gradient descent
<http://scholar.google.com/scholar?hl=en&q=gradient+descent+with+map-reduc> or
check out Mahout
<https://mahout.apache.org/users/recommender/matrix-factorization.html>
or Spark
MLlib <https://spark.apache.org/docs/latest/mllib-guide.html> for examples.
Alternatively you can use something like GraphLab
<http://graphlab.com/products/create/docs/graphlab.toolkits.recommender.html>
.

Cassandra can serve a data store from which you load the training data e.g.
into Spark  using this connector
<https://github.com/datastax/spark-cassandra-connector> and then train the
model using MLlib or Mahout (it has Spark bindings I believe). Once you
trained the model, you could save the parameters back in Cassandra. Then
the next stage is using the model to classify new data, e.g. recommend
similar items based on a log of new purchases, there you could once again
use Spark or Storm with something like this
<https://github.com/pmerienne/trident-ml>.

Alex

On Fri, Aug 29, 2014 at 10:24 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   I’m planning to speak at a local meet-up and I need to know if what I
> have in my head is even possible.
>
>  I want to give an example of working with data in Cassandra. I have data
> coming in through Kafka and Storm and I’m saving it off to Cassandra (this
> is only on paper at this point). I then want to run an ML algorithm over
> the data. My problem here is, while my data is distributed, I don’t know
> how to do the analysis in a distributed manner. I could certainly use R but
> processing the data on a single machine would seem to defeat the purpose of
> all this scalability.
>
>  What is my solution?
>  B.
>

Machine Learning With Cassandra

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.

I’m planning to speak at a local meet-up and I need to know if what I have in my head is even possible.
I want to give an example of working with data in Cassandra. I have data coming in through Kafka and Storm and I’m saving it off to Cassandra (this is only on paper at this point). I then want to run an ML algorithm over the data. My problem here is, while my data is distributed, I don’t know how to do the analysis in a distributed manner. I could certainly use R but processing the data on a single machine would seem to defeat the purpose of all this scalability.
What is my solution?
B.