You are viewing a plain text version of this content. The canonical link for it is here.
Posted to droids-dev@incubator.apache.org by Otis Gospodnetic <ot...@yahoo.com> on 2009/11/12 21:37:22 UTC

Queue: in memory or on disk?

Hello,

I haven't looked at the sources.  But who stores items put in the Queue?  Are they in memory, or does something write them to disk, or something else?

Thanks,
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR


Re: Queue: in memory or on disk?

Posted by Chapuis Bertil <bc...@agimem.com>.
Personally I used Droids to crawl a website of approximately 250000 pages. The queue was stored in memory and I arbitrarily allocated 1GB of memory to java. Everything worked fine. 

That's not a large number of webpages but I think droids' current implementation is well suited for such jobs: crawling a relatively small set of webpage or crawling an intranet. This is particularly right if you need to customize the handling process of the pages. 

I Hope this experience may help.

Bertil Chapuis


On Nov 14, 2009, at 3:59 AM, Otis Gospodnetic wrote:

> OK, thanks.
> 
> So how do people really use Droids at scale? e.g. crawling a large number of web pages?  I happen to use it for something smalish, so I never had issues with the queue being in the JVM heap and getting OOMs because of that.  But I imagine that anyone using it for a larger crawl would hit OOM sooner or later, no?
> 
> Does this imply that either nobody is using Droids for large-scale crawls, or that everyone who does implemented their own, custom disk-backed queue?
> 
> 
> Thanks,
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> 
> 
> 
> ----- Original Message ----
>> From: Ryan McKinley <ry...@gmail.com>
>> To: droids-dev@incubator.apache.org
>> Sent: Fri, November 13, 2009 5:17:51 PM
>> Subject: Re: Queue: in memory or on disk?
>> 
>> ya, the standard one is in memory.
>> 
>> It is easy to write one to store things to disk or whatever -- I use one that 
>> stores tasks to an h2 database, but it is not general enough to contribute 
>> back...
>> 
>> I think Migfa was looking at replacing the droids Queue interface with a 
>> standard java.util.Queue interface
>> 
>> ryan
>> 
>> 
>> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:
>> 
>>> I think the current implementation only provides in memory queues of tasks. 
>> However, since the TaskQueue interface is relatively simple it shouldn't be too 
>> hard to persists the data on the disk or to implement a TaskQueue which works 
>> with a JMS broker or something else.
>>> 
>>> 
>>> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I haven't looked at the sources.  But who stores items put in the Queue?  Are 
>> they in memory, or does something write them to disk, or something else?
>>>> 
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>>>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>>> 
>>> 
> 


Re: Queue: in memory or on disk?

Posted by Otis Gospodnetic <og...@yahoo.com>.
OK, thanks.

So how do people really use Droids at scale? e.g. crawling a large number of web pages?  I happen to use it for something smalish, so I never had issues with the queue being in the JVM heap and getting OOMs because of that.  But I imagine that anyone using it for a larger crawl would hit OOM sooner or later, no?

Does this imply that either nobody is using Droids for large-scale crawls, or that everyone who does implemented their own, custom disk-backed queue?


Thanks,
Otis
--
Sematext is hiring -- http://sematext.com/about/jobs.html?mls
Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR



----- Original Message ----
> From: Ryan McKinley <ry...@gmail.com>
> To: droids-dev@incubator.apache.org
> Sent: Fri, November 13, 2009 5:17:51 PM
> Subject: Re: Queue: in memory or on disk?
> 
> ya, the standard one is in memory.
> 
> It is easy to write one to store things to disk or whatever -- I use one that 
> stores tasks to an h2 database, but it is not general enough to contribute 
> back...
> 
> I think Migfa was looking at replacing the droids Queue interface with a 
> standard java.util.Queue interface
> 
> ryan
> 
> 
> On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:
> 
> > I think the current implementation only provides in memory queues of tasks. 
> However, since the TaskQueue interface is relatively simple it shouldn't be too 
> hard to persists the data on the disk or to implement a TaskQueue which works 
> with a JMS broker or something else.
> > 
> > 
> > On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
> > 
> >> Hello,
> >> 
> >> I haven't looked at the sources.  But who stores items put in the Queue?  Are 
> they in memory, or does something write them to disk, or something else?
> >> 
> >> Thanks,
> >> Otis
> >> --
> >> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> >> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
> >> 
> > 


Re: Queue: in memory or on disk?

Posted by Ryan McKinley <ry...@gmail.com>.
ya, the standard one is in memory.

It is easy to write one to store things to disk or whatever -- I use  
one that stores tasks to an h2 database, but it is not general enough  
to contribute back...

I think Migfa was looking at replacing the droids Queue interface with  
a standard java.util.Queue interface

ryan


On Nov 13, 2009, at 5:10 PM, Chapuis Bertil wrote:

> I think the current implementation only provides in memory queues of  
> tasks. However, since the TaskQueue interface is relatively simple  
> it shouldn't be too hard to persists the data on the disk or to  
> implement a TaskQueue which works with a JMS broker or something else.
>
>
> On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:
>
>> Hello,
>>
>> I haven't looked at the sources.  But who stores items put in the  
>> Queue?  Are they in memory, or does something write them to disk,  
>> or something else?
>>
>> Thanks,
>> Otis
>> --
>> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
>> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>>
>


Re: Queue: in memory or on disk?

Posted by Chapuis Bertil <bc...@agimem.com>.
I think the current implementation only provides in memory queues of tasks. However, since the TaskQueue interface is relatively simple it shouldn't be too hard to persists the data on the disk or to implement a TaskQueue which works with a JMS broker or something else.


On Nov 12, 2009, at 10:37 PM, Otis Gospodnetic wrote:

> Hello,
> 
> I haven't looked at the sources.  But who stores items put in the Queue?  Are they in memory, or does something write them to disk, or something else?
> 
> Thanks,
> Otis
> --
> Sematext is hiring -- http://sematext.com/about/jobs.html?mls
> Lucene, Solr, Nutch, Katta, Hadoop, HBase, UIMA, NLP, NER, IR
>