You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by matt <go...@gmail.com> on 2018/10/09 13:36:40 UTC

Delay queue or similar?

Hi,

I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd
like to ensure the crawler obey's the appropriate Craw-Delay time as set in
a site's robots.txt file - the way I have this setup now, is by submitting
"candidates" to an Ignite cache. A local listener is setup to receive
successfully persisted items, which then submits the items to a queue for a
fetcher to pull from.

Goal: Support a delay time + maximum fetch concurrency, per-host, per-item.

Put another way: "for each fetch item, ensure that requests made to the
associated host are delayed as required, and no more than n-requests are
made during each delayed run".

This could be modeled as a Map<Host,DelayQueue> or maybe even a by using
ScheduledExecutorService where each task represents a host, and is repeated
according to the delay time.

I'd like to prevent items from being put into the java work queue if they
are not yet ready to be fetched, and I'm slightly worried about the
potential number of hosts (in reference to the java Map<Host,...>
data-structure).

So my question is: is there something that Ignite can provide for making
this all work?

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Delay queue or similar?

Posted by matt <go...@gmail.com>.
Ok will try that. Cheers!
- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Delay queue or similar?

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

You could have secondary (SQL) index on time, and do SELECT ORDER BY time
to get most eager hosts.

For initial time, you could 0L as default value. I.e. check for null => use
0L if null.

Regards,

-- 
Ilya Kasnacheev


чт, 11 окт. 2018 г. в 0:20, matt <go...@gmail.com>:

> Thanks for the feedback, Ilya!
>
> In your example, where would the initial "host" in "long time =
> cache.get(host);" come from? In the case I need to solve for, I would not
> know what host would be most suitable to make a request to, so would need
> to
> continuously loop over all available keys until the crawl is done. This may
> introduce a performance hit, if (for example) the only host that is ready
> for a request is the last one in a very large list of keys. Does that make
> sense? Apologies if I'm misunderstanding!
>
> - Matt
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>

Re: Delay queue or similar?

Posted by matt <go...@gmail.com>.
Thanks for the feedback, Ilya!

In your example, where would the initial "host" in "long time =
cache.get(host);" come from? In the case I need to solve for, I would not
know what host would be most suitable to make a request to, so would need to
continuously loop over all available keys until the crawl is done. This may
introduce a performance hit, if (for example) the only host that is ready
for a request is the last one in a very large list of keys. Does that make
sense? Apologies if I'm misunderstanding!

- Matt



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/

Re: Delay queue or similar?

Posted by Ilya Kasnacheev <il...@gmail.com>.
Hello!

I think you could model it with ATOMIC cache:

while (true) {
    long time = cache.get(host);
    if (time < System.currentTimeMillis() && cache.replace(host, time, time
+ hostDelay) {
        // do request to host
        // break
    else
        // sleep or do other requests in the meantime
}

Regards,
-- 
Ilya Kasnacheev


вт, 9 окт. 2018 г. в 16:36, matt <go...@gmail.com>:

> Hi,
>
> I'm working on prototyping a web crawler using Ignite as the crawl-db. I'd
> like to ensure the crawler obey's the appropriate Craw-Delay time as set in
> a site's robots.txt file - the way I have this setup now, is by submitting
> "candidates" to an Ignite cache. A local listener is setup to receive
> successfully persisted items, which then submits the items to a queue for a
> fetcher to pull from.
>
> Goal: Support a delay time + maximum fetch concurrency, per-host, per-item.
>
> Put another way: "for each fetch item, ensure that requests made to the
> associated host are delayed as required, and no more than n-requests are
> made during each delayed run".
>
> This could be modeled as a Map<Host,DelayQueue> or maybe even a by using
> ScheduledExecutorService where each task represents a host, and is repeated
> according to the delay time.
>
> I'd like to prevent items from being put into the java work queue if they
> are not yet ready to be fetched, and I'm slightly worried about the
> potential number of hosts (in reference to the java Map<Host,...>
> data-structure).
>
> So my question is: is there something that Ignite can provide for making
> this all work?
>
> - Matt
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>