You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by ritika jain <ri...@gmail.com> on 2020/07/06 12:52:13 UTC

WebCrawler Connector code

Hi All,

I have confusion regarding WebCrawler connector code.My requirement is to
abort a job whenever a seed-corresponding site is down or returning some
5xx response codes.
So I have used the jobManager errorAbort method for this
in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
to get a Job ID.

My confusion here is to get all seeds corresponding to corresponding job
iD. So I used getAllSeeds() method declared in IJobManager Class.

Query here is getAllSeeds method when used is returning a length zero array
always.As I doubt this method is not having its corresponding definition in
its implementation class.
*Why this method has not been implemented in its Implementation class
JobManager.*

*Code done is:-*
    String[]
array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
array 1 is always returning empty array.

*Also another query is *
public String addSeedDocuments(ISeedingActivity activities, Specification
spec,
    String lastSeedVersion, long seedTime, int jobMode)
    throws ManifoldCFException, ServiceInterruption

activities object is having jobID of the job which is calling this addSeeds
method, but the interface as well its implementation class is having no
getter(java) method to get JobID in the method.(it is set in
constructor only)


Can anybody please guide me on this.

Thanks
Ritika

Re: WebCrawler Connector code

Posted by Karl Wright <da...@gmail.com>.

Hi Ritika,

You do not want to load the list of seeds on every document processing that
is done for performance reasons.  The connector API does not support
accessing arbitrary job data in part for this reason.

You should NEVER be calling JobManager methods from a connector either.
You have *Activity methods that you can call.

Karl


On Tue, Jul 7, 2020 at 4:04 AM ritika jain <ri...@gmail.com> wrote:

> Hi  Karl,
>
> Many thanks for your response.!!
>
> The problem I faced is to get Current JobID , so that's why I used the
> JobStatus class. another thing is to get the seeds corresponding to the
> running JOb ID.
>
> activities object is having value of job ID set in its constructor object.
> But no way  to get the value in WebCrawlerConnector.java as no getter is
> defined.
>
> Another thing is JobManager is having a function getAllSeeds which is
> defined in its interface class IJobManager, but not defined in its
> implementation class JobManager, so it is always returning an empty value.
>
> Thanks
>
>
> On Mon, Jul 6, 2020 at 6:44 PM Karl Wright <da...@gmail.com> wrote:
>
>> Hi Ritika,
>>
>> ' My requirement is to abort a job whenever a seed-corresponding site is
>> down or returning some 5xx response codes. '
>>
>> (1) Connector methods, like addSeedDocuments(), are called by the
>> framework.  You do not call them yourself when you write a connector.  So
>> you are looking in the wrong place here.
>> (2) All that addSeedDocuments does in the web connector is add seed URLs
>> to the queue for the job.  You do not want to change this implementation.
>> (3) The only time the web connector fetches anything is when it is
>> processing documents, in the processDocuments() method.
>> (4) You don't get to control the queue.  Documents are processed by the
>> framework in the order *it* determines they should be processed.  You can
>> create an "event" which must be satisfied before processing can occur but
>> that is all the control you get at the connector level.
>> (5) Similarly, you don't get told which document URLs are seeds.  This
>> information is in the job, and it is included in the job queue "isSeed"
>> field for each document, but it is never sent to any connector method.
>>
>> It is therefore possible to add "isSeed" to the IRepositoryConnector
>> processDocuments() method, which will change the contract for all
>> connectors.  You might be able to prevent carnage by creating a
>> BaseRepositoryConnector method implementation and abstract method that
>> would provide a shim for most connectors.
>>
>> Karl
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 6, 2020 at 8:52 AM ritika jain <ri...@gmail.com>
>> wrote:
>>
>>> Hi All,
>>>
>>> I have confusion regarding WebCrawler connector code.My requirement is
>>> to abort a job whenever a seed-corresponding site is down or returning some
>>> 5xx response codes.
>>> So I have used the jobManager errorAbort method for this
>>> in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
>>> to get a Job ID.
>>>
>>> My confusion here is to get all seeds corresponding to corresponding job
>>> iD. So I used getAllSeeds() method declared in IJobManager Class.
>>>
>>> Query here is getAllSeeds method when used is returning a length zero
>>> array always.As I doubt this method is not having its corresponding
>>> definition in its implementation class.
>>> *Why this method has not been implemented in its Implementation class
>>> JobManager.*
>>>
>>> *Code done is:-*
>>>     String[]
>>> array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
>>> array 1 is always returning empty array.
>>>
>>> *Also another query is *
>>> public String addSeedDocuments(ISeedingActivity activities,
>>> Specification spec,
>>>     String lastSeedVersion, long seedTime, int jobMode)
>>>     throws ManifoldCFException, ServiceInterruption
>>>
>>> activities object is having jobID of the job which is calling this
>>> addSeeds method, but the interface as well its implementation class is
>>> having no getter(java) method to get JobID in the method.(it is set in
>>> constructor only)
>>>
>>>
>>> Can anybody please guide me on this.
>>>
>>> Thanks
>>> Ritika
>>>
>>>
>>>
>>>

Re: WebCrawler Connector code

Posted by ritika jain <ri...@gmail.com>.

Hi  Karl,

Many thanks for your response.!!

The problem I faced is to get Current JobID , so that's why I used the
JobStatus class. another thing is to get the seeds corresponding to the
running JOb ID.

activities object is having value of job ID set in its constructor object.
But no way  to get the value in WebCrawlerConnector.java as no getter is
defined.

Another thing is JobManager is having a function getAllSeeds which is
defined in its interface class IJobManager, but not defined in its
implementation class JobManager, so it is always returning an empty value.

Thanks


On Mon, Jul 6, 2020 at 6:44 PM Karl Wright <da...@gmail.com> wrote:

> Hi Ritika,
>
> ' My requirement is to abort a job whenever a seed-corresponding site is
> down or returning some 5xx response codes. '
>
> (1) Connector methods, like addSeedDocuments(), are called by the
> framework.  You do not call them yourself when you write a connector.  So
> you are looking in the wrong place here.
> (2) All that addSeedDocuments does in the web connector is add seed URLs
> to the queue for the job.  You do not want to change this implementation.
> (3) The only time the web connector fetches anything is when it is
> processing documents, in the processDocuments() method.
> (4) You don't get to control the queue.  Documents are processed by the
> framework in the order *it* determines they should be processed.  You can
> create an "event" which must be satisfied before processing can occur but
> that is all the control you get at the connector level.
> (5) Similarly, you don't get told which document URLs are seeds.  This
> information is in the job, and it is included in the job queue "isSeed"
> field for each document, but it is never sent to any connector method.
>
> It is therefore possible to add "isSeed" to the IRepositoryConnector
> processDocuments() method, which will change the contract for all
> connectors.  You might be able to prevent carnage by creating a
> BaseRepositoryConnector method implementation and abstract method that
> would provide a shim for most connectors.
>
> Karl
>
>
>
>
>
>
> On Mon, Jul 6, 2020 at 8:52 AM ritika jain <ri...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have confusion regarding WebCrawler connector code.My requirement is to
>> abort a job whenever a seed-corresponding site is down or returning some
>> 5xx response codes.
>> So I have used the jobManager errorAbort method for this
>> in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
>> to get a Job ID.
>>
>> My confusion here is to get all seeds corresponding to corresponding job
>> iD. So I used getAllSeeds() method declared in IJobManager Class.
>>
>> Query here is getAllSeeds method when used is returning a length zero
>> array always.As I doubt this method is not having its corresponding
>> definition in its implementation class.
>> *Why this method has not been implemented in its Implementation class
>> JobManager.*
>>
>> *Code done is:-*
>>     String[]
>> array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
>> array 1 is always returning empty array.
>>
>> *Also another query is *
>> public String addSeedDocuments(ISeedingActivity activities, Specification
>> spec,
>>     String lastSeedVersion, long seedTime, int jobMode)
>>     throws ManifoldCFException, ServiceInterruption
>>
>> activities object is having jobID of the job which is calling this
>> addSeeds method, but the interface as well its implementation class is
>> having no getter(java) method to get JobID in the method.(it is set in
>> constructor only)
>>
>>
>> Can anybody please guide me on this.
>>
>> Thanks
>> Ritika
>>
>>
>>
>>

Re: WebCrawler Connector code

Posted by Karl Wright <da...@gmail.com>.

Hi Ritika,

' My requirement is to abort a job whenever a seed-corresponding site is
down or returning some 5xx response codes. '

(1) Connector methods, like addSeedDocuments(), are called by the
framework.  You do not call them yourself when you write a connector.  So
you are looking in the wrong place here.
(2) All that addSeedDocuments does in the web connector is add seed URLs to
the queue for the job.  You do not want to change this implementation.
(3) The only time the web connector fetches anything is when it is
processing documents, in the processDocuments() method.
(4) You don't get to control the queue.  Documents are processed by the
framework in the order *it* determines they should be processed.  You can
create an "event" which must be satisfied before processing can occur but
that is all the control you get at the connector level.
(5) Similarly, you don't get told which document URLs are seeds.  This
information is in the job, and it is included in the job queue "isSeed"
field for each document, but it is never sent to any connector method.

It is therefore possible to add "isSeed" to the IRepositoryConnector
processDocuments() method, which will change the contract for all
connectors.  You might be able to prevent carnage by creating a
BaseRepositoryConnector method implementation and abstract method that
would provide a shim for most connectors.

Karl

On Mon, Jul 6, 2020 at 8:52 AM ritika jain <ri...@gmail.com> wrote:

> Hi All,
>
> I have confusion regarding WebCrawler connector code.My requirement is to
> abort a job whenever a seed-corresponding site is down or returning some
> 5xx response codes.
> So I have used the jobManager errorAbort method for this
> in addSeedDocuments method of Webcrawlerconnector.java.., JobStatus class
> to get a Job ID.
>
> My confusion here is to get all seeds corresponding to corresponding job
> iD. So I used getAllSeeds() method declared in IJobManager Class.
>
> Query here is getAllSeeds method when used is returning a length zero
> array always.As I doubt this method is not having its corresponding
> definition in its implementation class.
> *Why this method has not been implemented in its Implementation class
> JobManager.*
>
> *Code done is:-*
>     String[]
> array1=jobManager.getAllSeeds(Long.parseLong(jsr[k].getJobID()));
> array 1 is always returning empty array.
>
> *Also another query is *
> public String addSeedDocuments(ISeedingActivity activities, Specification
> spec,
>     String lastSeedVersion, long seedTime, int jobMode)
>     throws ManifoldCFException, ServiceInterruption
>
> activities object is having jobID of the job which is calling this
> addSeeds method, but the interface as well its implementation class is
> having no getter(java) method to get JobID in the method.(it is set in
> constructor only)
>
>
> Can anybody please guide me on this.
>
> Thanks
> Ritika
>
>
>
>