You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Albert Vila <al...@augure.com> on 2014/10/07 12:26:56 UTC

Is storm a good choice for our architecture?

Hi

I just came across Storm when I was trying to find solutions to scale our
current architecture.

We are currently downloading and processing 6M documents per day from
online and social media. We have a different workflow for each type of
document, but some of the steps are keyword extraction, language detection,
clustering, classification, indexation, .... We are using Gearman to
dispatch the job to workers.

I'm wondering if we could integrate Storm on the current workflow and if
it's feasible. One of our main discussions are if we have to go to a fully
distributed architecture or to a semi-distributed one. I mean, distribute
everything or process some steps on the same machine (crawling, keyword
extraction, language detection, indexation). We don't know which one scales
more, each one has pros and cont.

Now we have a semi-distributed one as we had network problems taking into
account the amount of data we were moving around. So now, all documents
crawled on server X, later on are dispatched through Gearman to the same
server, having all data on a Memcached locally.

What do you think?
It's feasible to migrate to a Storm cluster?
Should we take into account the traffic among the Storm cluster?
Is there a way to isolate some bolts to be processed on the same machine
grouped by some field?

Any help or comment will be appreciate. And If someone has had a similar
problem and has knowledge about the architecture approach will be more than
welcomed.

Thanks

Albert

Re: Is storm a good choice for our architecture?

Posted by Albert Vila <al...@augure.com>.

Thanks, I'll take a look at it.

Regards,

Albert

On 8 October 2014 10:24, Richards Peter <hb...@gmail.com> wrote:

> You can use custom scheduler to execute specific tasks on a particular
> machine. Please check the following mail thread:
>
> https://groups.google.com/forum/m/#!topic/storm-user/n_-HZWSdcmM
>
> Regards,
> Richards Peter.
>

-- 
*Albert Vila*
R&D Manager & Software Developer

Tél. : +34 972 982 968

*www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
<http://blog.augure.es/> | *Twitter. *@AugureSpain
<https://twitter.com/AugureSpain>
*Skype *: albert.vila | *Access map.* Augure Girona
<https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>

Re: Is storm a good choice for our architecture?

Posted by Richards Peter <hb...@gmail.com>.

You can use custom scheduler to execute specific tasks on a particular
machine. Please check the following mail thread:

https://groups.google.com/forum/m/#!topic/storm-user/n_-HZWSdcmM

Regards,
Richards Peter.

Re: Is storm a good choice for our architecture?

Posted by Onur Ünlü <un...@gmail.com>.

Hi Albert,

You can use "local or shuffle grouping".

https://storm.incubator.apache.org/documentation/Concepts.html (stream
groupings)

Onur


On Wed, Oct 8, 2014 at 10:11 AM, Albert Vila <al...@augure.com> wrote:

> Now we are using Gearman, don't know if anyone succeed using Gearman to
> populate the Spouts.
>
> And one of my questions is if it's possible to isolate the execution of
> some bolts on a specific machine, so all keyword extraction for document X
> is done by the same machine it crawled it. Or maybe I should not be concern
> about network traffic on a Storm cluster.
>
> Regards
>
> Albert
>
> On 7 October 2014 13:26, padma priya chitturi <pa...@gmail.com>
> wrote:
>
>> Storm would be feasible to your business problem. You could actually
>> design the topology in  such a way that few bolts would be doing the job of
>> keyword extraction, another set of bolts doing language detection etc etc.
>> You can apply you clusterin g and classification algorithms of Mahout on
>> streams of data processed by bolts.
>>
>> But only thing that i am concerned is if your data would be coming from
>> some datasource like kafka, that would be great. I don't think spouts
>> reading data from files would be  the best fit.
>>
>> Regards,
>> Padma Ch
>>
>> On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <al...@augure.com>
>> wrote:
>>
>>> Hi
>>>
>>> I just came across Storm when I was trying to find solutions to scale
>>> our current architecture.
>>>
>>> We are currently downloading and processing 6M documents per day from
>>> online and social media. We have a different workflow for each type of
>>> document, but some of the steps are keyword extraction, language detection,
>>> clustering, classification, indexation, .... We are using Gearman to
>>> dispatch the job to workers.
>>>
>>> I'm wondering if we could integrate Storm on the current workflow and if
>>> it's feasible. One of our main discussions are if we have to go to a fully
>>> distributed architecture or to a semi-distributed one. I mean, distribute
>>> everything or process some steps on the same machine (crawling, keyword
>>> extraction, language detection, indexation). We don't know which one scales
>>> more, each one has pros and cont.
>>>
>>> Now we have a semi-distributed one as we had network problems taking
>>> into account the amount of data we were moving around. So now, all
>>> documents crawled on server X, later on are dispatched through Gearman to
>>> the same server, having all data on a Memcached locally.
>>>
>>> What do you think?
>>> It's feasible to migrate to a Storm cluster?
>>> Should we take into account the traffic among the Storm cluster?
>>> Is there a way to isolate some bolts to be processed on the same machine
>>> grouped by some field?
>>>
>>> Any help or comment will be appreciate. And If someone has had a similar
>>> problem and has knowledge about the architecture approach will be more than
>>> welcomed.
>>>
>>> Thanks
>>>
>>> Albert
>>>
>>
>>
>
>
> --
> *Albert Vila*
> R&D Manager & Software Developer
>
>
> Tél. : +34 972 982 968
>
> *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
> <http://blog.augure.es/> | *Twitter. *@AugureSpain
> <https://twitter.com/AugureSpain>
> *Skype *: albert.vila | *Access map.* Augure Girona
> <https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>
>



-- 
Onur Ünlü

Re: Is storm a good choice for our architecture?

Posted by Albert Vila <al...@augure.com>.

Now we are using Gearman, don't know if anyone succeed using Gearman to
populate the Spouts.

And one of my questions is if it's possible to isolate the execution of
some bolts on a specific machine, so all keyword extraction for document X
is done by the same machine it crawled it. Or maybe I should not be concern
about network traffic on a Storm cluster.

Regards

Albert

On 7 October 2014 13:26, padma priya chitturi <pa...@gmail.com>
wrote:

> Storm would be feasible to your business problem. You could actually
> design the topology in  such a way that few bolts would be doing the job of
> keyword extraction, another set of bolts doing language detection etc etc.
> You can apply you clusterin g and classification algorithms of Mahout on
> streams of data processed by bolts.
>
> But only thing that i am concerned is if your data would be coming from
> some datasource like kafka, that would be great. I don't think spouts
> reading data from files would be  the best fit.
>
> Regards,
> Padma Ch
>
> On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <al...@augure.com>
> wrote:
>
>> Hi
>>
>> I just came across Storm when I was trying to find solutions to scale our
>> current architecture.
>>
>> We are currently downloading and processing 6M documents per day from
>> online and social media. We have a different workflow for each type of
>> document, but some of the steps are keyword extraction, language detection,
>> clustering, classification, indexation, .... We are using Gearman to
>> dispatch the job to workers.
>>
>> I'm wondering if we could integrate Storm on the current workflow and if
>> it's feasible. One of our main discussions are if we have to go to a fully
>> distributed architecture or to a semi-distributed one. I mean, distribute
>> everything or process some steps on the same machine (crawling, keyword
>> extraction, language detection, indexation). We don't know which one scales
>> more, each one has pros and cont.
>>
>> Now we have a semi-distributed one as we had network problems taking into
>> account the amount of data we were moving around. So now, all documents
>> crawled on server X, later on are dispatched through Gearman to the same
>> server, having all data on a Memcached locally.
>>
>> What do you think?
>> It's feasible to migrate to a Storm cluster?
>> Should we take into account the traffic among the Storm cluster?
>> Is there a way to isolate some bolts to be processed on the same machine
>> grouped by some field?
>>
>> Any help or comment will be appreciate. And If someone has had a similar
>> problem and has knowledge about the architecture approach will be more than
>> welcomed.
>>
>> Thanks
>>
>> Albert
>>
>
>


-- 
*Albert Vila*
R&D Manager & Software Developer


Tél. : +34 972 982 968

*www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
<http://blog.augure.es/> | *Twitter. *@AugureSpain
<https://twitter.com/AugureSpain>
*Skype *: albert.vila | *Access map.* Augure Girona
<https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16>

Re: Is storm a good choice for our architecture?

Posted by padma priya chitturi <pa...@gmail.com>.

Storm would be feasible to your business problem. You could actually design
the topology in  such a way that few bolts would be doing the job of
keyword extraction, another set of bolts doing language detection etc etc.
You can apply you clusterin g and classification algorithms of Mahout on
streams of data processed by bolts.

But only thing that i am concerned is if your data would be coming from
some datasource like kafka, that would be great. I don't think spouts
reading data from files would be  the best fit.

Regards,
Padma Ch

On Tue, Oct 7, 2014 at 3:56 PM, Albert Vila <al...@augure.com> wrote:

> Hi
>
> I just came across Storm when I was trying to find solutions to scale our
> current architecture.
>
> We are currently downloading and processing 6M documents per day from
> online and social media. We have a different workflow for each type of
> document, but some of the steps are keyword extraction, language detection,
> clustering, classification, indexation, .... We are using Gearman to
> dispatch the job to workers.
>
> I'm wondering if we could integrate Storm on the current workflow and if
> it's feasible. One of our main discussions are if we have to go to a fully
> distributed architecture or to a semi-distributed one. I mean, distribute
> everything or process some steps on the same machine (crawling, keyword
> extraction, language detection, indexation). We don't know which one scales
> more, each one has pros and cont.
>
> Now we have a semi-distributed one as we had network problems taking into
> account the amount of data we were moving around. So now, all documents
> crawled on server X, later on are dispatched through Gearman to the same
> server, having all data on a Memcached locally.
>
> What do you think?
> It's feasible to migrate to a Storm cluster?
> Should we take into account the traffic among the Storm cluster?
> Is there a way to isolate some bolts to be processed on the same machine
> grouped by some field?
>
> Any help or comment will be appreciate. And If someone has had a similar
> problem and has knowledge about the architecture approach will be more than
> welcomed.
>
> Thanks
>
> Albert
>