You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com> on 2014/07/02 21:31:52 UTC

Spark vs. Storm

Do these two projects do essentially the same thing? Is one better than the other?

Re: Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
>From a high level that makes sense but I have no idea how you’d implement that. You can’t have expertise in everything. For example Mongo and Cassandra are both complex databases that require a decent amount of knowledge to make run properly. But they are entirely different in the way they operate and store data. If you’re an employee and you’re going to stick around and manage whatever you build and your company is going to let you learn as you go that would be fine. If you’re dropping in as a consultant where the company is going to expect you to know your stuff from day one, it seems like you’re saying that you’re going to be limited to the engagements that you can take.

From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 4:23 PM
To: user@hadoop.apache.org 
Subject: Re: Big Data tech stack (was Spark vs. Storm)

You will not be arriving at a generic stack without oversimplifying to the point of serious deficiencies. There are as you say a multitude of options.  You are attempting to boil them down to  A vs B as opposed to A may work better under the following conditions ..




2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>:

  You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
  Ubuntu
  Hadoop
  Cassandra (Seems to be the highest performing NoSQL database out there.)
  Storm (maybe?)
  Python (Easier than Java. Maybe that shouldn’t be a concern.)
  Hive (For people to leverage their existing SQL skillset.)

  That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

  B.



  From: Stephen Boesch 
  Sent: Wednesday, July 02, 2014 3:07 PM
  To: user@hadoop.apache.org 
  Subject: Re: Spark vs. Storm

  Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

  If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



  2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

    Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

    My 2-cents:


    Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

    Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

    Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

    So given this, you can pick the framework which is more attuned to your needs.



    On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Do these two projects do essentially the same thing? Is one better than the other?



Re: Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
>From a high level that makes sense but I have no idea how you’d implement that. You can’t have expertise in everything. For example Mongo and Cassandra are both complex databases that require a decent amount of knowledge to make run properly. But they are entirely different in the way they operate and store data. If you’re an employee and you’re going to stick around and manage whatever you build and your company is going to let you learn as you go that would be fine. If you’re dropping in as a consultant where the company is going to expect you to know your stuff from day one, it seems like you’re saying that you’re going to be limited to the engagements that you can take.

From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 4:23 PM
To: user@hadoop.apache.org 
Subject: Re: Big Data tech stack (was Spark vs. Storm)

You will not be arriving at a generic stack without oversimplifying to the point of serious deficiencies. There are as you say a multitude of options.  You are attempting to boil them down to  A vs B as opposed to A may work better under the following conditions ..




2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>:

  You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
  Ubuntu
  Hadoop
  Cassandra (Seems to be the highest performing NoSQL database out there.)
  Storm (maybe?)
  Python (Easier than Java. Maybe that shouldn’t be a concern.)
  Hive (For people to leverage their existing SQL skillset.)

  That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

  B.



  From: Stephen Boesch 
  Sent: Wednesday, July 02, 2014 3:07 PM
  To: user@hadoop.apache.org 
  Subject: Re: Spark vs. Storm

  Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

  If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



  2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

    Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

    My 2-cents:


    Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

    Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

    Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

    So given this, you can pick the framework which is more attuned to your needs.



    On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Do these two projects do essentially the same thing? Is one better than the other?



Re: Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
>From a high level that makes sense but I have no idea how you’d implement that. You can’t have expertise in everything. For example Mongo and Cassandra are both complex databases that require a decent amount of knowledge to make run properly. But they are entirely different in the way they operate and store data. If you’re an employee and you’re going to stick around and manage whatever you build and your company is going to let you learn as you go that would be fine. If you’re dropping in as a consultant where the company is going to expect you to know your stuff from day one, it seems like you’re saying that you’re going to be limited to the engagements that you can take.

From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 4:23 PM
To: user@hadoop.apache.org 
Subject: Re: Big Data tech stack (was Spark vs. Storm)

You will not be arriving at a generic stack without oversimplifying to the point of serious deficiencies. There are as you say a multitude of options.  You are attempting to boil them down to  A vs B as opposed to A may work better under the following conditions ..




2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>:

  You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
  Ubuntu
  Hadoop
  Cassandra (Seems to be the highest performing NoSQL database out there.)
  Storm (maybe?)
  Python (Easier than Java. Maybe that shouldn’t be a concern.)
  Hive (For people to leverage their existing SQL skillset.)

  That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

  B.



  From: Stephen Boesch 
  Sent: Wednesday, July 02, 2014 3:07 PM
  To: user@hadoop.apache.org 
  Subject: Re: Spark vs. Storm

  Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

  If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



  2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

    Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

    My 2-cents:


    Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

    Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

    Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

    So given this, you can pick the framework which is more attuned to your needs.



    On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Do these two projects do essentially the same thing? Is one better than the other?



Re: Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
>From a high level that makes sense but I have no idea how you’d implement that. You can’t have expertise in everything. For example Mongo and Cassandra are both complex databases that require a decent amount of knowledge to make run properly. But they are entirely different in the way they operate and store data. If you’re an employee and you’re going to stick around and manage whatever you build and your company is going to let you learn as you go that would be fine. If you’re dropping in as a consultant where the company is going to expect you to know your stuff from day one, it seems like you’re saying that you’re going to be limited to the engagements that you can take.

From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 4:23 PM
To: user@hadoop.apache.org 
Subject: Re: Big Data tech stack (was Spark vs. Storm)

You will not be arriving at a generic stack without oversimplifying to the point of serious deficiencies. There are as you say a multitude of options.  You are attempting to boil them down to  A vs B as opposed to A may work better under the following conditions ..




2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com>:

  You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
  Ubuntu
  Hadoop
  Cassandra (Seems to be the highest performing NoSQL database out there.)
  Storm (maybe?)
  Python (Easier than Java. Maybe that shouldn’t be a concern.)
  Hive (For people to leverage their existing SQL skillset.)

  That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

  B.



  From: Stephen Boesch 
  Sent: Wednesday, July 02, 2014 3:07 PM
  To: user@hadoop.apache.org 
  Subject: Re: Spark vs. Storm

  Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

  If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



  2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

    Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

    My 2-cents:


    Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

    Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

    Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

    So given this, you can pick the framework which is more attuned to your needs.



    On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

      Do these two projects do essentially the same thing? Is one better than the other?



Re: Big Data tech stack (was Spark vs. Storm)

Posted by Stephen Boesch <ja...@gmail.com>.
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..


2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com>:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Stephen Boesch <ja...@gmail.com>.
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..


2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com>:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Stephen. At best you will arrive at a point where you can
tell "I don't care about your problems here is the solution". Even though
it sounds attractive if you are paid to set up the solution, that's really
not the position a 'client' would want you to hold.

Bertrand Dechoux


On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yue <yu...@gmail.com> wrote:

> Isn't this what Yarn or Mesos are trying to do?  Separate the resources
> management and applications. Run whatever suitable above them.  Spark also
> could run above yanr or mesos. Spark was designed for iteration intensive
> computing like Machine learning algorithms.
>
> Storm is quite different.  It is not designed for big data stored in the
> hard disk. It is inspired by the stream data like tweets. On the other
> side, Map-Reduce/HDFS was initially designed to handle stored webpage to
> build up index.
>
> Hadoop is on the way to become a generic Big Data analysis framework.
> HontonWorks and Cloudera are trying to make it much easier on management
> and deployment.
>
>
>
> On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   You know what I’m really trying to do? I’m trying to come up with a
>> best practice technology stack. There are so many freaking projects it is
>> overwhelming. If I were to walk into an organization that had no Big Data
>> capability, what mix of projects would be best to implement based on
>> performance, scalability and easy of use/implementation? So far I’ve got:
>> Ubuntu
>> Hadoop
>> Cassandra (Seems to be the highest performing NoSQL database out there.)
>> Storm (maybe?)
>> Python (Easier than Java. Maybe that shouldn’t be a concern.)
>> Hive (For people to leverage their existing SQL skillset.)
>>
>> That would seem to cover transaction processing and warehouse storage and
>> the capability to do batch and real time analysis. What am I leaving out or
>> what do I have incorrect in my assumptions?
>>
>> B.
>>
>>
>>
>>  *From:* Stephen Boesch <ja...@gmail.com>
>> *Sent:* Wednesday, July 02, 2014 3:07 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs. Storm
>>
>>  Spark Streaming discretizes the stream by configurable intervals of no
>> less than 500Milliseconds. Therefore it is not appropriate for true real
>> time processing.So if you need to capture events in the low 100's of
>> milliseonds range or less than stick with Storm (at least for now).
>>
>> If you can afford one second+ of latency then spark provides advantages
>> of interoperability with the other Spark components and capabilities.
>>
>>
>> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>>
>>> Not exactly. There are of course  major implementation differences and
>>> then some subtle and high level ones too.
>>>
>>> My 2-cents:
>>>
>>> Spark is in-memory M/R and it simulated streaming or real-time
>>> distributed process for large datasets by micro-batching. The gain in speed
>>> and performance as opposed to batch paradigm is in-memory buffering or
>>> batching (and I am here being a bit naive/crude in explanation.)
>>>
>>> Storm on the other hand, supports stream processing even at a single
>>> record level (known as tuple in its lingo.) You can do micro-batching on
>>> top of it as well (using Trident API which is good for state maintenance
>>> too, if your BL requires that). This is more applicable where you want
>>> control to a single record level rather than set, collection or batch of
>>> records.
>>>
>>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>>> granular approach but as far as I recall, it still is built on top of core
>>> Spark (basically another level of abstraction over core Spark constructs.)
>>>
>>> So given this, you can pick the framework which is more attuned to your
>>> needs.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Do these two projects do essentially the same thing? Is one better
>>>> than the other?
>>>>
>>>
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Stephen. At best you will arrive at a point where you can
tell "I don't care about your problems here is the solution". Even though
it sounds attractive if you are paid to set up the solution, that's really
not the position a 'client' would want you to hold.

Bertrand Dechoux


On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yue <yu...@gmail.com> wrote:

> Isn't this what Yarn or Mesos are trying to do?  Separate the resources
> management and applications. Run whatever suitable above them.  Spark also
> could run above yanr or mesos. Spark was designed for iteration intensive
> computing like Machine learning algorithms.
>
> Storm is quite different.  It is not designed for big data stored in the
> hard disk. It is inspired by the stream data like tweets. On the other
> side, Map-Reduce/HDFS was initially designed to handle stored webpage to
> build up index.
>
> Hadoop is on the way to become a generic Big Data analysis framework.
> HontonWorks and Cloudera are trying to make it much easier on management
> and deployment.
>
>
>
> On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   You know what I’m really trying to do? I’m trying to come up with a
>> best practice technology stack. There are so many freaking projects it is
>> overwhelming. If I were to walk into an organization that had no Big Data
>> capability, what mix of projects would be best to implement based on
>> performance, scalability and easy of use/implementation? So far I’ve got:
>> Ubuntu
>> Hadoop
>> Cassandra (Seems to be the highest performing NoSQL database out there.)
>> Storm (maybe?)
>> Python (Easier than Java. Maybe that shouldn’t be a concern.)
>> Hive (For people to leverage their existing SQL skillset.)
>>
>> That would seem to cover transaction processing and warehouse storage and
>> the capability to do batch and real time analysis. What am I leaving out or
>> what do I have incorrect in my assumptions?
>>
>> B.
>>
>>
>>
>>  *From:* Stephen Boesch <ja...@gmail.com>
>> *Sent:* Wednesday, July 02, 2014 3:07 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs. Storm
>>
>>  Spark Streaming discretizes the stream by configurable intervals of no
>> less than 500Milliseconds. Therefore it is not appropriate for true real
>> time processing.So if you need to capture events in the low 100's of
>> milliseonds range or less than stick with Storm (at least for now).
>>
>> If you can afford one second+ of latency then spark provides advantages
>> of interoperability with the other Spark components and capabilities.
>>
>>
>> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>>
>>> Not exactly. There are of course  major implementation differences and
>>> then some subtle and high level ones too.
>>>
>>> My 2-cents:
>>>
>>> Spark is in-memory M/R and it simulated streaming or real-time
>>> distributed process for large datasets by micro-batching. The gain in speed
>>> and performance as opposed to batch paradigm is in-memory buffering or
>>> batching (and I am here being a bit naive/crude in explanation.)
>>>
>>> Storm on the other hand, supports stream processing even at a single
>>> record level (known as tuple in its lingo.) You can do micro-batching on
>>> top of it as well (using Trident API which is good for state maintenance
>>> too, if your BL requires that). This is more applicable where you want
>>> control to a single record level rather than set, collection or batch of
>>> records.
>>>
>>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>>> granular approach but as far as I recall, it still is built on top of core
>>> Spark (basically another level of abstraction over core Spark constructs.)
>>>
>>> So given this, you can pick the framework which is more attuned to your
>>> needs.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Do these two projects do essentially the same thing? Is one better
>>>> than the other?
>>>>
>>>
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Stephen. At best you will arrive at a point where you can
tell "I don't care about your problems here is the solution". Even though
it sounds attractive if you are paid to set up the solution, that's really
not the position a 'client' would want you to hold.

Bertrand Dechoux


On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yue <yu...@gmail.com> wrote:

> Isn't this what Yarn or Mesos are trying to do?  Separate the resources
> management and applications. Run whatever suitable above them.  Spark also
> could run above yanr or mesos. Spark was designed for iteration intensive
> computing like Machine learning algorithms.
>
> Storm is quite different.  It is not designed for big data stored in the
> hard disk. It is inspired by the stream data like tweets. On the other
> side, Map-Reduce/HDFS was initially designed to handle stored webpage to
> build up index.
>
> Hadoop is on the way to become a generic Big Data analysis framework.
> HontonWorks and Cloudera are trying to make it much easier on management
> and deployment.
>
>
>
> On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   You know what I’m really trying to do? I’m trying to come up with a
>> best practice technology stack. There are so many freaking projects it is
>> overwhelming. If I were to walk into an organization that had no Big Data
>> capability, what mix of projects would be best to implement based on
>> performance, scalability and easy of use/implementation? So far I’ve got:
>> Ubuntu
>> Hadoop
>> Cassandra (Seems to be the highest performing NoSQL database out there.)
>> Storm (maybe?)
>> Python (Easier than Java. Maybe that shouldn’t be a concern.)
>> Hive (For people to leverage their existing SQL skillset.)
>>
>> That would seem to cover transaction processing and warehouse storage and
>> the capability to do batch and real time analysis. What am I leaving out or
>> what do I have incorrect in my assumptions?
>>
>> B.
>>
>>
>>
>>  *From:* Stephen Boesch <ja...@gmail.com>
>> *Sent:* Wednesday, July 02, 2014 3:07 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs. Storm
>>
>>  Spark Streaming discretizes the stream by configurable intervals of no
>> less than 500Milliseconds. Therefore it is not appropriate for true real
>> time processing.So if you need to capture events in the low 100's of
>> milliseonds range or less than stick with Storm (at least for now).
>>
>> If you can afford one second+ of latency then spark provides advantages
>> of interoperability with the other Spark components and capabilities.
>>
>>
>> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>>
>>> Not exactly. There are of course  major implementation differences and
>>> then some subtle and high level ones too.
>>>
>>> My 2-cents:
>>>
>>> Spark is in-memory M/R and it simulated streaming or real-time
>>> distributed process for large datasets by micro-batching. The gain in speed
>>> and performance as opposed to batch paradigm is in-memory buffering or
>>> batching (and I am here being a bit naive/crude in explanation.)
>>>
>>> Storm on the other hand, supports stream processing even at a single
>>> record level (known as tuple in its lingo.) You can do micro-batching on
>>> top of it as well (using Trident API which is good for state maintenance
>>> too, if your BL requires that). This is more applicable where you want
>>> control to a single record level rather than set, collection or batch of
>>> records.
>>>
>>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>>> granular approach but as far as I recall, it still is built on top of core
>>> Spark (basically another level of abstraction over core Spark constructs.)
>>>
>>> So given this, you can pick the framework which is more attuned to your
>>> needs.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Do these two projects do essentially the same thing? Is one better
>>>> than the other?
>>>>
>>>
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Bertrand Dechoux <de...@gmail.com>.
I will second Stephen. At best you will arrive at a point where you can
tell "I don't care about your problems here is the solution". Even though
it sounds attractive if you are paid to set up the solution, that's really
not the position a 'client' would want you to hold.

Bertrand Dechoux


On Thu, Jul 3, 2014 at 1:10 AM, Gavin Yue <yu...@gmail.com> wrote:

> Isn't this what Yarn or Mesos are trying to do?  Separate the resources
> management and applications. Run whatever suitable above them.  Spark also
> could run above yanr or mesos. Spark was designed for iteration intensive
> computing like Machine learning algorithms.
>
> Storm is quite different.  It is not designed for big data stored in the
> hard disk. It is inspired by the stream data like tweets. On the other
> side, Map-Reduce/HDFS was initially designed to handle stored webpage to
> build up index.
>
> Hadoop is on the way to become a generic Big Data analysis framework.
> HontonWorks and Cloudera are trying to make it much easier on management
> and deployment.
>
>
>
> On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   You know what I’m really trying to do? I’m trying to come up with a
>> best practice technology stack. There are so many freaking projects it is
>> overwhelming. If I were to walk into an organization that had no Big Data
>> capability, what mix of projects would be best to implement based on
>> performance, scalability and easy of use/implementation? So far I’ve got:
>> Ubuntu
>> Hadoop
>> Cassandra (Seems to be the highest performing NoSQL database out there.)
>> Storm (maybe?)
>> Python (Easier than Java. Maybe that shouldn’t be a concern.)
>> Hive (For people to leverage their existing SQL skillset.)
>>
>> That would seem to cover transaction processing and warehouse storage and
>> the capability to do batch and real time analysis. What am I leaving out or
>> what do I have incorrect in my assumptions?
>>
>> B.
>>
>>
>>
>>  *From:* Stephen Boesch <ja...@gmail.com>
>> *Sent:* Wednesday, July 02, 2014 3:07 PM
>> *To:* user@hadoop.apache.org
>> *Subject:* Re: Spark vs. Storm
>>
>>  Spark Streaming discretizes the stream by configurable intervals of no
>> less than 500Milliseconds. Therefore it is not appropriate for true real
>> time processing.So if you need to capture events in the low 100's of
>> milliseonds range or less than stick with Storm (at least for now).
>>
>> If you can afford one second+ of latency then spark provides advantages
>> of interoperability with the other Spark components and capabilities.
>>
>>
>> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>>
>>> Not exactly. There are of course  major implementation differences and
>>> then some subtle and high level ones too.
>>>
>>> My 2-cents:
>>>
>>> Spark is in-memory M/R and it simulated streaming or real-time
>>> distributed process for large datasets by micro-batching. The gain in speed
>>> and performance as opposed to batch paradigm is in-memory buffering or
>>> batching (and I am here being a bit naive/crude in explanation.)
>>>
>>> Storm on the other hand, supports stream processing even at a single
>>> record level (known as tuple in its lingo.) You can do micro-batching on
>>> top of it as well (using Trident API which is good for state maintenance
>>> too, if your BL requires that). This is more applicable where you want
>>> control to a single record level rather than set, collection or batch of
>>> records.
>>>
>>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>>> granular approach but as far as I recall, it still is built on top of core
>>> Spark (basically another level of abstraction over core Spark constructs.)
>>>
>>> So given this, you can pick the framework which is more attuned to your
>>> needs.
>>>
>>>
>>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>>> adaryl.wakefield@hotmail.com> wrote:
>>>
>>>>   Do these two projects do essentially the same thing? Is one better
>>>> than the other?
>>>>
>>>
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Gavin Yue <yu...@gmail.com>.
Isn't this what Yarn or Mesos are trying to do?  Separate the resources
management and applications. Run whatever suitable above them.  Spark also
could run above yanr or mesos. Spark was designed for iteration intensive
computing like Machine learning algorithms.

Storm is quite different.  It is not designed for big data stored in the
hard disk. It is inspired by the stream data like tweets. On the other
side, Map-Reduce/HDFS was initially designed to handle stored webpage to
build up index.

Hadoop is on the way to become a generic Big Data analysis framework.
HontonWorks and Cloudera are trying to make it much easier on management
and deployment.



On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Gavin Yue <yu...@gmail.com>.
Isn't this what Yarn or Mesos are trying to do?  Separate the resources
management and applications. Run whatever suitable above them.  Spark also
could run above yanr or mesos. Spark was designed for iteration intensive
computing like Machine learning algorithms.

Storm is quite different.  It is not designed for big data stored in the
hard disk. It is inspired by the stream data like tweets. On the other
side, Map-Reduce/HDFS was initially designed to handle stored webpage to
build up index.

Hadoop is on the way to become a generic Big Data analysis framework.
HontonWorks and Cloudera are trying to make it much easier on management
and deployment.



On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Gavin Yue <yu...@gmail.com>.
Isn't this what Yarn or Mesos are trying to do?  Separate the resources
management and applications. Run whatever suitable above them.  Spark also
could run above yanr or mesos. Spark was designed for iteration intensive
computing like Machine learning algorithms.

Storm is quite different.  It is not designed for big data stored in the
hard disk. It is inspired by the stream data like tweets. On the other
side, Map-Reduce/HDFS was initially designed to handle stored webpage to
build up index.

Hadoop is on the way to become a generic Big Data analysis framework.
HontonWorks and Cloudera are trying to make it much easier on management
and deployment.



On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Stephen Boesch <ja...@gmail.com>.
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..


2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com>:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Stephen Boesch <ja...@gmail.com>.
You will not be arriving at a generic stack without oversimplifying to the
point of serious deficiencies. There are as you say a multitude of
options.  You are attempting to boil them down to  A vs B as opposed to A
may work better under the following conditions ..


2014-07-02 13:25 GMT-07:00 Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com>:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Re: Big Data tech stack (was Spark vs. Storm)

Posted by Gavin Yue <yu...@gmail.com>.
Isn't this what Yarn or Mesos are trying to do?  Separate the resources
management and applications. Run whatever suitable above them.  Spark also
could run above yanr or mesos. Spark was designed for iteration intensive
computing like Machine learning algorithms.

Storm is quite different.  It is not designed for big data stored in the
hard disk. It is inspired by the stream data like tweets. On the other
side, Map-Reduce/HDFS was initially designed to handle stored webpage to
build up index.

Hadoop is on the way to become a generic Big Data analysis framework.
HontonWorks and Cloudera are trying to make it much easier on management
and deployment.



On Wed, Jul 2, 2014 at 4:25 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   You know what I’m really trying to do? I’m trying to come up with a
> best practice technology stack. There are so many freaking projects it is
> overwhelming. If I were to walk into an organization that had no Big Data
> capability, what mix of projects would be best to implement based on
> performance, scalability and easy of use/implementation? So far I’ve got:
> Ubuntu
> Hadoop
> Cassandra (Seems to be the highest performing NoSQL database out there.)
> Storm (maybe?)
> Python (Easier than Java. Maybe that shouldn’t be a concern.)
> Hive (For people to leverage their existing SQL skillset.)
>
> That would seem to cover transaction processing and warehouse storage and
> the capability to do batch and real time analysis. What am I leaving out or
> what do I have incorrect in my assumptions?
>
> B.
>
>
>
>  *From:* Stephen Boesch <ja...@gmail.com>
> *Sent:* Wednesday, July 02, 2014 3:07 PM
> *To:* user@hadoop.apache.org
> *Subject:* Re: Spark vs. Storm
>
>  Spark Streaming discretizes the stream by configurable intervals of no
> less than 500Milliseconds. Therefore it is not appropriate for true real
> time processing.So if you need to capture events in the low 100's of
> milliseonds range or less than stick with Storm (at least for now).
>
> If you can afford one second+ of latency then spark provides advantages of
> interoperability with the other Spark components and capabilities.
>
>
> 2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:
>
>> Not exactly. There are of course  major implementation differences and
>> then some subtle and high level ones too.
>>
>> My 2-cents:
>>
>> Spark is in-memory M/R and it simulated streaming or real-time
>> distributed process for large datasets by micro-batching. The gain in speed
>> and performance as opposed to batch paradigm is in-memory buffering or
>> batching (and I am here being a bit naive/crude in explanation.)
>>
>> Storm on the other hand, supports stream processing even at a single
>> record level (known as tuple in its lingo.) You can do micro-batching on
>> top of it as well (using Trident API which is good for state maintenance
>> too, if your BL requires that). This is more applicable where you want
>> control to a single record level rather than set, collection or batch of
>> records.
>>
>> Having said that, Spark Streaming is trying to simulate Storm's extreme
>> granular approach but as far as I recall, it still is built on top of core
>> Spark (basically another level of abstraction over core Spark constructs.)
>>
>> So given this, you can pick the framework which is more attuned to your
>> needs.
>>
>>
>> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
>> adaryl.wakefield@hotmail.com> wrote:
>>
>>>   Do these two projects do essentially the same thing? Is one better
>>> than the other?
>>>
>>
>>
>
>

Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
Ubuntu
Hadoop
Cassandra (Seems to be the highest performing NoSQL database out there.)
Storm (maybe?)
Python (Easier than Java. Maybe that shouldn’t be a concern.)
Hive (For people to leverage their existing SQL skillset.)

That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

B.



From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 3:07 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs. Storm

Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

  Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

  My 2-cents:


  Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

  Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

  Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

  So given this, you can pick the framework which is more attuned to your needs.



  On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Do these two projects do essentially the same thing? Is one better than the other?


Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
Ubuntu
Hadoop
Cassandra (Seems to be the highest performing NoSQL database out there.)
Storm (maybe?)
Python (Easier than Java. Maybe that shouldn’t be a concern.)
Hive (For people to leverage their existing SQL skillset.)

That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

B.



From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 3:07 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs. Storm

Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

  Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

  My 2-cents:


  Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

  Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

  Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

  So given this, you can pick the framework which is more attuned to your needs.



  On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Do these two projects do essentially the same thing? Is one better than the other?


Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
Ubuntu
Hadoop
Cassandra (Seems to be the highest performing NoSQL database out there.)
Storm (maybe?)
Python (Easier than Java. Maybe that shouldn’t be a concern.)
Hive (For people to leverage their existing SQL skillset.)

That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

B.



From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 3:07 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs. Storm

Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

  Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

  My 2-cents:


  Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

  Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

  Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

  So given this, you can pick the framework which is more attuned to your needs.



  On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Do these two projects do essentially the same thing? Is one better than the other?


Big Data tech stack (was Spark vs. Storm)

Posted by "Adaryl \"Bob\" Wakefield, MBA" <ad...@hotmail.com>.
You know what I’m really trying to do? I’m trying to come up with a best practice technology stack. There are so many freaking projects it is overwhelming. If I were to walk into an organization that had no Big Data capability, what mix of projects would be best to implement based on performance, scalability and easy of use/implementation? So far I’ve got:
Ubuntu
Hadoop
Cassandra (Seems to be the highest performing NoSQL database out there.)
Storm (maybe?)
Python (Easier than Java. Maybe that shouldn’t be a concern.)
Hive (For people to leverage their existing SQL skillset.)

That would seem to cover transaction processing and warehouse storage and the capability to do batch and real time analysis. What am I leaving out or what do I have incorrect in my assumptions?

B.



From: Stephen Boesch 
Sent: Wednesday, July 02, 2014 3:07 PM
To: user@hadoop.apache.org 
Subject: Re: Spark vs. Storm

Spark Streaming discretizes the stream by configurable intervals of no less than 500Milliseconds. Therefore it is not appropriate for true real time processing.So if you need to capture events in the low 100's of milliseonds range or less than stick with Storm (at least for now). 

If you can afford one second+ of latency then spark provides advantages of interoperability with the other Spark components and capabilities.



2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

  Not exactly. There are of course  major implementation differences and then some subtle and high level ones too. 

  My 2-cents:


  Spark is in-memory M/R and it simulated streaming or real-time distributed process for large datasets by micro-batching. The gain in speed and performance as opposed to batch paradigm is in-memory buffering or batching (and I am here being a bit naive/crude in explanation.)

  Storm on the other hand, supports stream processing even at a single record level (known as tuple in its lingo.) You can do micro-batching on top of it as well (using Trident API which is good for state maintenance too, if your BL requires that). This is more applicable where you want control to a single record level rather than set, collection or batch of records.

  Having said that, Spark Streaming is trying to simulate Storm's extreme granular approach but as far as I recall, it still is built on top of core Spark (basically another level of abstraction over core Spark constructs.)

  So given this, you can pick the framework which is more attuned to your needs.



  On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <ad...@hotmail.com> wrote:

    Do these two projects do essentially the same thing? Is one better than the other?


Re: Spark vs. Storm

Posted by Stephen Boesch <ja...@gmail.com>.
Spark Streaming discretizes the stream by configurable intervals of no less
than 500Milliseconds. Therefore it is not appropriate for true real time
processing.So if you need to capture events in the low 100's of milliseonds
range or less than stick with Storm (at least for now).

If you can afford one second+ of latency then spark provides advantages of
interoperability with the other Spark components and capabilities.


2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

> Not exactly. There are of course  major implementation differences and
> then some subtle and high level ones too.
>
> My 2-cents:
>
> Spark is in-memory M/R and it simulated streaming or real-time distributed
> process for large datasets by micro-batching. The gain in speed and
> performance as opposed to batch paradigm is in-memory buffering or batching
> (and I am here being a bit naive/crude in explanation.)
>
> Storm on the other hand, supports stream processing even at a single
> record level (known as tuple in its lingo.) You can do micro-batching on
> top of it as well (using Trident API which is good for state maintenance
> too, if your BL requires that). This is more applicable where you want
> control to a single record level rather than set, collection or batch of
> records.
>
> Having said that, Spark Streaming is trying to simulate Storm's extreme
> granular approach but as far as I recall, it still is built on top of core
> Spark (basically another level of abstraction over core Spark constructs.)
>
> So given this, you can pick the framework which is more attuned to your
> needs.
>
>
> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Do these two projects do essentially the same thing? Is one better
>> than the other?
>>
>
>

Re: Spark vs. Storm

Posted by Stephen Boesch <ja...@gmail.com>.
Spark Streaming discretizes the stream by configurable intervals of no less
than 500Milliseconds. Therefore it is not appropriate for true real time
processing.So if you need to capture events in the low 100's of milliseonds
range or less than stick with Storm (at least for now).

If you can afford one second+ of latency then spark provides advantages of
interoperability with the other Spark components and capabilities.


2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

> Not exactly. There are of course  major implementation differences and
> then some subtle and high level ones too.
>
> My 2-cents:
>
> Spark is in-memory M/R and it simulated streaming or real-time distributed
> process for large datasets by micro-batching. The gain in speed and
> performance as opposed to batch paradigm is in-memory buffering or batching
> (and I am here being a bit naive/crude in explanation.)
>
> Storm on the other hand, supports stream processing even at a single
> record level (known as tuple in its lingo.) You can do micro-batching on
> top of it as well (using Trident API which is good for state maintenance
> too, if your BL requires that). This is more applicable where you want
> control to a single record level rather than set, collection or batch of
> records.
>
> Having said that, Spark Streaming is trying to simulate Storm's extreme
> granular approach but as far as I recall, it still is built on top of core
> Spark (basically another level of abstraction over core Spark constructs.)
>
> So given this, you can pick the framework which is more attuned to your
> needs.
>
>
> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Do these two projects do essentially the same thing? Is one better
>> than the other?
>>
>
>

Re: Spark vs. Storm

Posted by Stephen Boesch <ja...@gmail.com>.
Spark Streaming discretizes the stream by configurable intervals of no less
than 500Milliseconds. Therefore it is not appropriate for true real time
processing.So if you need to capture events in the low 100's of milliseonds
range or less than stick with Storm (at least for now).

If you can afford one second+ of latency then spark provides advantages of
interoperability with the other Spark components and capabilities.


2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

> Not exactly. There are of course  major implementation differences and
> then some subtle and high level ones too.
>
> My 2-cents:
>
> Spark is in-memory M/R and it simulated streaming or real-time distributed
> process for large datasets by micro-batching. The gain in speed and
> performance as opposed to batch paradigm is in-memory buffering or batching
> (and I am here being a bit naive/crude in explanation.)
>
> Storm on the other hand, supports stream processing even at a single
> record level (known as tuple in its lingo.) You can do micro-batching on
> top of it as well (using Trident API which is good for state maintenance
> too, if your BL requires that). This is more applicable where you want
> control to a single record level rather than set, collection or batch of
> records.
>
> Having said that, Spark Streaming is trying to simulate Storm's extreme
> granular approach but as far as I recall, it still is built on top of core
> Spark (basically another level of abstraction over core Spark constructs.)
>
> So given this, you can pick the framework which is more attuned to your
> needs.
>
>
> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Do these two projects do essentially the same thing? Is one better
>> than the other?
>>
>
>

Re: Spark vs. Storm

Posted by Stephen Boesch <ja...@gmail.com>.
Spark Streaming discretizes the stream by configurable intervals of no less
than 500Milliseconds. Therefore it is not appropriate for true real time
processing.So if you need to capture events in the low 100's of milliseonds
range or less than stick with Storm (at least for now).

If you can afford one second+ of latency then spark provides advantages of
interoperability with the other Spark components and capabilities.


2014-07-02 12:59 GMT-07:00 Shahab Yunus <sh...@gmail.com>:

> Not exactly. There are of course  major implementation differences and
> then some subtle and high level ones too.
>
> My 2-cents:
>
> Spark is in-memory M/R and it simulated streaming or real-time distributed
> process for large datasets by micro-batching. The gain in speed and
> performance as opposed to batch paradigm is in-memory buffering or batching
> (and I am here being a bit naive/crude in explanation.)
>
> Storm on the other hand, supports stream processing even at a single
> record level (known as tuple in its lingo.) You can do micro-batching on
> top of it as well (using Trident API which is good for state maintenance
> too, if your BL requires that). This is more applicable where you want
> control to a single record level rather than set, collection or batch of
> records.
>
> Having said that, Spark Streaming is trying to simulate Storm's extreme
> granular approach but as far as I recall, it still is built on top of core
> Spark (basically another level of abstraction over core Spark constructs.)
>
> So given this, you can pick the framework which is more attuned to your
> needs.
>
>
> On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
> adaryl.wakefield@hotmail.com> wrote:
>
>>   Do these two projects do essentially the same thing? Is one better
>> than the other?
>>
>
>

Re: Spark vs. Storm

Posted by Shahab Yunus <sh...@gmail.com>.
Not exactly. There are of course  major implementation differences and then
some subtle and high level ones too.

My 2-cents:

Spark is in-memory M/R and it simulated streaming or real-time distributed
process for large datasets by micro-batching. The gain in speed and
performance as opposed to batch paradigm is in-memory buffering or batching
(and I am here being a bit naive/crude in explanation.)

Storm on the other hand, supports stream processing even at a single record
level (known as tuple in its lingo.) You can do micro-batching on top of it
as well (using Trident API which is good for state maintenance too, if your
BL requires that). This is more applicable where you want control to a
single record level rather than set, collection or batch of records.

Having said that, Spark Streaming is trying to simulate Storm's extreme
granular approach but as far as I recall, it still is built on top of core
Spark (basically another level of abstraction over core Spark constructs.)

So given this, you can pick the framework which is more attuned to your
needs.


On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Do these two projects do essentially the same thing? Is one better than
> the other?
>

Re: Spark vs. Storm

Posted by Shahab Yunus <sh...@gmail.com>.
Not exactly. There are of course  major implementation differences and then
some subtle and high level ones too.

My 2-cents:

Spark is in-memory M/R and it simulated streaming or real-time distributed
process for large datasets by micro-batching. The gain in speed and
performance as opposed to batch paradigm is in-memory buffering or batching
(and I am here being a bit naive/crude in explanation.)

Storm on the other hand, supports stream processing even at a single record
level (known as tuple in its lingo.) You can do micro-batching on top of it
as well (using Trident API which is good for state maintenance too, if your
BL requires that). This is more applicable where you want control to a
single record level rather than set, collection or batch of records.

Having said that, Spark Streaming is trying to simulate Storm's extreme
granular approach but as far as I recall, it still is built on top of core
Spark (basically another level of abstraction over core Spark constructs.)

So given this, you can pick the framework which is more attuned to your
needs.


On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Do these two projects do essentially the same thing? Is one better than
> the other?
>

Re: Spark vs. Storm

Posted by Shahab Yunus <sh...@gmail.com>.
Not exactly. There are of course  major implementation differences and then
some subtle and high level ones too.

My 2-cents:

Spark is in-memory M/R and it simulated streaming or real-time distributed
process for large datasets by micro-batching. The gain in speed and
performance as opposed to batch paradigm is in-memory buffering or batching
(and I am here being a bit naive/crude in explanation.)

Storm on the other hand, supports stream processing even at a single record
level (known as tuple in its lingo.) You can do micro-batching on top of it
as well (using Trident API which is good for state maintenance too, if your
BL requires that). This is more applicable where you want control to a
single record level rather than set, collection or batch of records.

Having said that, Spark Streaming is trying to simulate Storm's extreme
granular approach but as far as I recall, it still is built on top of core
Spark (basically another level of abstraction over core Spark constructs.)

So given this, you can pick the framework which is more attuned to your
needs.


On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Do these two projects do essentially the same thing? Is one better than
> the other?
>

Re: Spark vs. Storm

Posted by Shahab Yunus <sh...@gmail.com>.
Not exactly. There are of course  major implementation differences and then
some subtle and high level ones too.

My 2-cents:

Spark is in-memory M/R and it simulated streaming or real-time distributed
process for large datasets by micro-batching. The gain in speed and
performance as opposed to batch paradigm is in-memory buffering or batching
(and I am here being a bit naive/crude in explanation.)

Storm on the other hand, supports stream processing even at a single record
level (known as tuple in its lingo.) You can do micro-batching on top of it
as well (using Trident API which is good for state maintenance too, if your
BL requires that). This is more applicable where you want control to a
single record level rather than set, collection or batch of records.

Having said that, Spark Streaming is trying to simulate Storm's extreme
granular approach but as far as I recall, it still is built on top of core
Spark (basically another level of abstraction over core Spark constructs.)

So given this, you can pick the framework which is more attuned to your
needs.


On Wed, Jul 2, 2014 at 3:31 PM, Adaryl "Bob" Wakefield, MBA <
adaryl.wakefield@hotmail.com> wrote:

>   Do these two projects do essentially the same thing? Is one better than
> the other?
>