You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@ignite.apache.org by vm <go...@gmail.com> on 2016/08/01 12:16:05 UTC

Implementing a distributed crawler

Hi,

I'm in the process of evaluating Ignite for use in an experimental
distributed web crawler.

I would like to avoid a master/worker architecture, and instead have each
node pulling URLs to crawl from a distributed Queue - which is populated by
the crawler instances themselves. Ignite's queue seems to be working fine
for this.

I'd like to ensure that previously seen links are never crawled more than
once. The Ignite Set sounds like the right place to start with this, but I
also wondered if just the Cache would work here? Related to this... I
wondered whether it would be possible to implement a system where URLs being
added to the system could be deterministically pushed to a node, so that the
"already visited" links could be managed by a node-local Set
(ConcurrentHashMap) instead? Or, maybe this deterministic routing of
Hash(Url) -> NodeX happens when URLs are taken off of the Queue? Of course,
if NodeX goes away due to problems, I'd need another node to take over the
processing for the same Hash(URL) values.

Of course I have performance concerns too, with too much network activity
when putting/taking the Queue - and also the potentially many checks needed
for the visited URLs.

Thank you for any thoughts or information on my questions,
VM



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Implementing a distributed crawler

Posted by vkulichenko <va...@gmail.com>.

As I already mentioned, I would use compute grid for this. As for the back
pressure, you can limit the number of jobs executed at the same time and
define different scheduling strategies [1]. The queue just looks like an
unnecessary piece here.

[1] https://apacheignite.readme.io/docs/job-scheduling

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654p6689.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Implementing a distributed crawler

Posted by vm <go...@gmail.com>.

Hi Val, thanks for you help. I really like the aspects of the blocking-queue
to provide natural "back pressure", but if performance becomes a real
problem here, can you suggest an alternative approach?



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654p6687.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Re: Implementing a distributed crawler

Posted by vkulichenko <va...@gmail.com>.

Hi VM,

This sounds like a good use case for Compute Grid [1]. It will allow you to
do parallel processing with proper failover and load balancing.

The issue with the distributed queue is that any operation is actually the
update of a single entry in cache, so this will be a single point of
contention. Most likely it will have negative effect on performance.

[1] https://apacheignite.readme.io/docs/compute-grid

-Val



--
View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Implementing-a-distributed-crawler-tp6654p6682.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.