You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Fuad Efendi <fu...@efendi.ca> on 2011/04/04 23:46:24 UTC

How to add tast to queue dynamically (WebCrawler)

Hi, I can understand how WebCrawler or RSS connector dynamically add new
URLs to a task list.

Is that something like

                      activities.setDocumentScheduleBounds(urlValue,
defaultRescanTime, defaultRescanTime, null, null);

 

In my specific case task simply stops.

 

Thanks!

Re: How to add tast to queue dynamically (WebCrawler)

Posted by Karl Wright <da...@gmail.com>.

Hi Fuad,

The ManifoldCF agents process is NOT a web application and NOT an
externally-managed thread pool. ManifoldCF consists of the following:

- java agents process
- mcf-crawler-ui web application
- mcf-authority-service web application
- mcf-api-service web application

For the ManifoldCF quick start, which is what you run when you run the
example, there is one process, which consists of all of these
components PLUS jetty.  But we obey the thread rules nevertheless,
because the web applications that are deployed under jetty do NO
crawling still.

Karl


On Tue, Apr 5, 2011 at 12:24 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi Karl,
>
> Yes, for politeness; RSS & WebCrawl seems extremely rich (and hard to use as
> a basic sample)
> Thread.sleep() is fine... but since it is a web container (externally
> managed thread pool) it's unsafe (especially if I don't know yet much
> details); JEE strictly advocates avoid thread-level programming; and Java 6
> has new features which we can use...
>
> I found easy temporary solution, using static local "lastCrawlAttempt", and
> then checking current time in processDocuments() method body (simply
> returning); and also using explicit scheduling
> activities.setDocumentScheduleBounds(newUrl, rescanTime, rescanTime, null,
> null);
> So that processDocuments doesn't do anything during specified delay
> Just temporary workaround...
>
> -Fuad
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: April-05-11 11:53 AM
> To: connectors-user@incubator.apache.org
> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>
> Hi Fuad,
>
> Ok, so this is for politeness?
> I am sure you've looked at what the RSS and Web connectors do to enforce
> politeness constraints.  As you probably know, the framework has the ability
> to throttle all connections using AVERAGE fetch rate throttling (see the
> "Throttling" tab for the connection).  But if you need to make sure you do
> not exceed a MAXIMUM rate, the standard is to adopt logic similar to that
> used by the RSS and Web connectors, which limit connection count as well as
> maximum fetch rate by way of connector-based throttling.
>
> I suppose that you may not like the Thread.sleep() you see in the throttling
> code in the RSS and Web connectors.  Since these connectors are throttling
> max connections as well as maximum fetch rate, it was not possible in all
> cases to avoid Thread.sleep().  But I can see a case for trying to control
> scheduling of documents for the purposes of enforcing a maximum fetch rate
> alone.
>
> In order for that to work, you'd need connector control over the schedule
> for every way a document can be added to the job queue.  The
> addDocumentReference() method is only one such case; you'd also want similar
> functionality for addSeedDocuments().  I'd suggest creating a ticket for
> this change to the API.  FWIW, I don't think this is a big win for either
> web or rss crawling, since all that the Thread.sleep() does is reduce
> (slightly) the number of available threads, so I'd prioritize it
> accordingly.
>
> Karl
>
>
> On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>> Hi Karl,
>>
>> I need to crawl sequence of (different) URLs from the same host, and
>> each URL defines next one to be crawled; I can crawl next URL only
>> after specified amount of time. URLs are different... of course I can
>> use
>> Thread.currentThread.sleep() before calling
>> activities.addDocumentReference(newUrl) but it seems too naïve...
>> And this use case is much similar to generic Web crawl (when we need
>> to be polite, 2-3 seconds delay before recrawl from same domain)
>>
>>
>> -----Original Message-----
>> From: Karl Wright [mailto:daddywri@gmail.com]
>> Sent: April-05-11 11:06 AM
>> To: connectors-user@incubator.apache.org
>> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>>
>> If you are trying to control the schedule for the FIRST time a
>> document is fetched, the IProcessActivity API doesn't permit that at
>> this time.  You would need to add a new version of
>> addDocumentReference() to the IProcessActivity interface, which
>> allowed you to set the scheduled processing time in addition to
>> everything else.  The internals for such a change should be
>> straightforward since all the moving parts are already there.
>>
>> I'm curious, however, about your use case.  It is currently unheard of
>> for connectors to try to control the scheduling of all documents being
>> fetched - this would interfere with ManifoldCF's scheduling
>> algorithms, which are designed for maximum throughput.  I'd like to be
>> sure your design makes sense before I agree that this is a reasonable
>> addition to the API.  Can you explain the connector and its design so
>> that I can see what you are trying to accomplish?
>>
>> Thanks!
>> Karl
>>
>> On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>>>
>>> Hi Karl,
>>>
>>> So this is "retry"... can we schedule document retrieval? I retrieve
>>> XML, generate new URL, and I want to schedule this new Document to be
>>> retrieved at specific time -Fuad
>>>
>>>
>>
>>
>
>

RE: How to add tast to queue dynamically (WebCrawler)

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Karl,

Yes, for politeness; RSS & WebCrawl seems extremely rich (and hard to use as
a basic sample)
Thread.sleep() is fine... but since it is a web container (externally
managed thread pool) it's unsafe (especially if I don't know yet much
details); JEE strictly advocates avoid thread-level programming; and Java 6
has new features which we can use...

I found easy temporary solution, using static local "lastCrawlAttempt", and
then checking current time in processDocuments() method body (simply
returning); and also using explicit scheduling
activities.setDocumentScheduleBounds(newUrl, rescanTime, rescanTime, null,
null);
So that processDocuments doesn't do anything during specified delay
Just temporary workaround...

-Fuad

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: April-05-11 11:53 AM
To: connectors-user@incubator.apache.org
Subject: Re: How to add tast to queue dynamically (WebCrawler)

Hi Fuad,

Ok, so this is for politeness?
I am sure you've looked at what the RSS and Web connectors do to enforce
politeness constraints.  As you probably know, the framework has the ability
to throttle all connections using AVERAGE fetch rate throttling (see the
"Throttling" tab for the connection).  But if you need to make sure you do
not exceed a MAXIMUM rate, the standard is to adopt logic similar to that
used by the RSS and Web connectors, which limit connection count as well as
maximum fetch rate by way of connector-based throttling.

I suppose that you may not like the Thread.sleep() you see in the throttling
code in the RSS and Web connectors.  Since these connectors are throttling
max connections as well as maximum fetch rate, it was not possible in all
cases to avoid Thread.sleep().  But I can see a case for trying to control
scheduling of documents for the purposes of enforcing a maximum fetch rate
alone.

In order for that to work, you'd need connector control over the schedule
for every way a document can be added to the job queue.  The
addDocumentReference() method is only one such case; you'd also want similar
functionality for addSeedDocuments().  I'd suggest creating a ticket for
this change to the API.  FWIW, I don't think this is a big win for either
web or rss crawling, since all that the Thread.sleep() does is reduce
(slightly) the number of available threads, so I'd prioritize it
accordingly.

Karl

On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi Karl,
>
> I need to crawl sequence of (different) URLs from the same host, and 
> each URL defines next one to be crawled; I can crawl next URL only 
> after specified amount of time. URLs are different... of course I can 
> use
> Thread.currentThread.sleep() before calling
> activities.addDocumentReference(newUrl) but it seems too naïve...
> And this use case is much similar to generic Web crawl (when we need 
> to be polite, 2-3 seconds delay before recrawl from same domain)
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: April-05-11 11:06 AM
> To: connectors-user@incubator.apache.org
> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>
> If you are trying to control the schedule for the FIRST time a 
> document is fetched, the IProcessActivity API doesn't permit that at 
> this time.  You would need to add a new version of
> addDocumentReference() to the IProcessActivity interface, which 
> allowed you to set the scheduled processing time in addition to 
> everything else.  The internals for such a change should be 
> straightforward since all the moving parts are already there.
>
> I'm curious, however, about your use case.  It is currently unheard of 
> for connectors to try to control the scheduling of all documents being 
> fetched - this would interfere with ManifoldCF's scheduling 
> algorithms, which are designed for maximum throughput.  I'd like to be 
> sure your design makes sense before I agree that this is a reasonable 
> addition to the API.  Can you explain the connector and its design so 
> that I can see what you are trying to accomplish?
>
> Thanks!
> Karl
>
> On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>>
>> Hi Karl,
>>
>> So this is "retry"... can we schedule document retrieval? I retrieve 
>> XML, generate new URL, and I want to schedule this new Document to be 
>> retrieved at specific time -Fuad
>>
>>
>
>

Re: How to add tast to queue dynamically (WebCrawler)

Posted by Karl Wright <da...@gmail.com>.

Hi Fuad,

Ok, so this is for politeness?
I am sure you've looked at what the RSS and Web connectors do to
enforce politeness constraints.  As you probably know, the framework
has the ability to throttle all connections using AVERAGE fetch rate
throttling (see the "Throttling" tab for the connection).  But if you
need to make sure you do not exceed a MAXIMUM rate, the standard is to
adopt logic similar to that used by the RSS and Web connectors, which
limit connection count as well as maximum fetch rate by way of
connector-based throttling.

I suppose that you may not like the Thread.sleep() you see in the
throttling code in the RSS and Web connectors.  Since these connectors
are throttling max connections as well as maximum fetch rate, it was
not possible in all cases to avoid Thread.sleep().  But I can see a
case for trying to control scheduling of documents for the purposes of
enforcing a maximum fetch rate alone.

In order for that to work, you'd need connector control over the
schedule for every way a document can be added to the job queue.  The
addDocumentReference() method is only one such case; you'd also want
similar functionality for addSeedDocuments().  I'd suggest creating a
ticket for this change to the API.  FWIW, I don't think this is a big
win for either web or rss crawling, since all that the Thread.sleep()
does is reduce (slightly) the number of available threads, so I'd
prioritize it accordingly.

Karl

On Tue, Apr 5, 2011 at 11:23 AM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi Karl,
>
> I need to crawl sequence of (different) URLs from the same host, and each
> URL defines next one to be crawled; I can crawl next URL only after
> specified amount of time. URLs are different... of course I can use
> Thread.currentThread.sleep() before calling
> activities.addDocumentReference(newUrl) but it seems too naïve...
> And this use case is much similar to generic Web crawl (when we need to be
> polite, 2-3 seconds delay before recrawl from same domain)
>
>
> -----Original Message-----
> From: Karl Wright [mailto:daddywri@gmail.com]
> Sent: April-05-11 11:06 AM
> To: connectors-user@incubator.apache.org
> Subject: Re: How to add tast to queue dynamically (WebCrawler)
>
> If you are trying to control the schedule for the FIRST time a document is
> fetched, the IProcessActivity API doesn't permit that at this time.  You
> would need to add a new version of
> addDocumentReference() to the IProcessActivity interface, which allowed you
> to set the scheduled processing time in addition to everything else.  The
> internals for such a change should be straightforward since all the moving
> parts are already there.
>
> I'm curious, however, about your use case.  It is currently unheard of for
> connectors to try to control the scheduling of all documents being fetched -
> this would interfere with ManifoldCF's scheduling algorithms, which are
> designed for maximum throughput.  I'd like to be sure your design makes
> sense before I agree that this is a reasonable addition to the API.  Can you
> explain the connector and its design so that I can see what you are trying
> to accomplish?
>
> Thanks!
> Karl
>
> On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>>
>> Hi Karl,
>>
>> So this is "retry"... can we schedule document retrieval? I retrieve
>> XML, generate new URL, and I want to schedule this new Document to be
>> retrieved at specific time -Fuad
>>
>>
>
>

RE: How to add tast to queue dynamically (WebCrawler)

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Karl,

I need to crawl sequence of (different) URLs from the same host, and each
URL defines next one to be crawled; I can crawl next URL only after
specified amount of time. URLs are different... of course I can use
Thread.currentThread.sleep() before calling
activities.addDocumentReference(newUrl) but it seems too naïve...
And this use case is much similar to generic Web crawl (when we need to be
polite, 2-3 seconds delay before recrawl from same domain)

-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: April-05-11 11:06 AM
To: connectors-user@incubator.apache.org
Subject: Re: How to add tast to queue dynamically (WebCrawler)

If you are trying to control the schedule for the FIRST time a document is
fetched, the IProcessActivity API doesn't permit that at this time.  You
would need to add a new version of
addDocumentReference() to the IProcessActivity interface, which allowed you
to set the scheduled processing time in addition to everything else.  The
internals for such a change should be straightforward since all the moving
parts are already there.

I'm curious, however, about your use case.  It is currently unheard of for
connectors to try to control the scheduling of all documents being fetched -
this would interfere with ManifoldCF's scheduling algorithms, which are
designed for maximum throughput.  I'd like to be sure your design makes
sense before I agree that this is a reasonable addition to the API.  Can you
explain the connector and its design so that I can see what you are trying
to accomplish?

Thanks!
Karl

On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>
> Hi Karl,
>
> So this is "retry"... can we schedule document retrieval? I retrieve 
> XML, generate new URL, and I want to schedule this new Document to be 
> retrieved at specific time -Fuad
>
>

Re: How to add tast to queue dynamically (WebCrawler)

Posted by Karl Wright <da...@gmail.com>.

If you are trying to control the schedule for the FIRST time a
document is fetched, the IProcessActivity API doesn't permit that at
this time.  You would need to add a new version of
addDocumentReference() to the IProcessActivity interface, which
allowed you to set the scheduled processing time in addition to
everything else.  The internals for such a change should be
straightforward since all the moving parts are already there.

I'm curious, however, about your use case.  It is currently unheard of
for connectors to try to control the scheduling of all documents being
fetched - this would interfere with ManifoldCF's scheduling
algorithms, which are designed for maximum throughput.  I'd like to be
sure your design makes sense before I agree that this is a reasonable
addition to the API.  Can you explain the connector and its design so
that I can see what you are trying to accomplish?

Thanks!
Karl

On Tue, Apr 5, 2011 at 10:51 AM, Fuad Efendi <fu...@efendi.ca> wrote:
>
> Hi Karl,
>
> So this is "retry"... can we schedule document retrieval? I retrieve XML,
> generate new URL, and I want to schedule this new Document to be retrieved
> at specific time
> -Fuad
>
>

RE: How to add tast to queue dynamically (WebCrawler)

Posted by Fuad Efendi <fu...@efendi.ca>.

Hi Karl,

So this is "retry"... can we schedule document retrieval? I retrieve XML,
generate new URL, and I want to schedule this new Document to be retrieved
at specific time
-Fuad


-----Original Message-----
From: Karl Wright [mailto:daddywri@gmail.com] 
Sent: April-04-11 6:36 PM
To: connectors-user@incubator.apache.org
Subject: Re: How to add tast to queue dynamically (WebCrawler)

Actually, the method you highlight simply sets the retry timing bounds for
the document.  The one that adds documents to the queue is
addDocumentReference().

Chapter 7 of ManifoldCF in Action addresses this.


Karl



On Mon, Apr 4, 2011 at 5:46 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi, I can understand how WebCrawler or RSS connector dynamically add 
> new URLs to a task list
>
> Is that something like
>
>                       activities.setDocumentScheduleBounds(urlValue,
> defaultRescanTime, defaultRescanTime, null, null);
>
>
>
> In my specific case task simply stops
>
>
>
> Thanks!
>
>
>
>

Re: How to add tast to queue dynamically (WebCrawler)

Posted by Karl Wright <da...@gmail.com>.

Actually, the method you highlight simply sets the retry timing bounds
for the document.  The one that adds documents to the queue is
addDocumentReference().

Chapter 7 of ManifoldCF in Action addresses this.

Karl

On Mon, Apr 4, 2011 at 5:46 PM, Fuad Efendi <fu...@efendi.ca> wrote:
> Hi, I can understand how WebCrawler or RSS connector dynamically add new
> URLs to a task list…
>
> Is that something like
>
>                       activities.setDocumentScheduleBounds(urlValue,
> defaultRescanTime, defaultRescanTime, null, null);
>
>
>
> In my specific case task simply stops…
>
>
>
> Thanks!
>
>
>
>