You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by lalit jangra <la...@gmail.com> on 2014/09/16 06:54:56 UTC

Reconciliation of documents crawled

Greetings ,

As part of implementation, i need to put a reconciliation mechanism in
place where it can be verified how many documents have been crawled for a
job and same can be displayed in logs.

First thing came into my mind is to put counters in e.g. CMIS connector
code in addSeed() and proecessDocuments() methods and increase it as we
progress but as i could see for CMIS that CmisRepositoryConnector.java is
getting called for each seeded document to be ingested, these counters are
not accurate. Is there any method where i can persist these counters within
code itself as i do not want to persist them in file system.

Please suggest.
-- 
Regards,
Lalit.

Re: Reconciliation of documents crawled

Posted by Karl Wright <da...@gmail.com>.

If you are going to write to a file, you might as well write to the log
file, since that mechanism is already available.

Karl




On Tue, Sep 16, 2014 at 2:44 AM, lalit jangra <la...@gmail.com>
wrote:

> Thanks Karl,
>
> As compared to all three methods suggested by you, i believe writing to
> file would be easier, correct me if i am wrong.
>
> What i initially thought that while job is running, i need to write
> counter values for each document seeded and processed as we are calling
> addSeedDocument() & processDocument() methods for each document. In this
> case, it would not be easy to reconcile after job is complete as i do have
> loads of data once job finishes and mapping them would be tough. This is
> why i am trying to avoid file based mechanism. Also i would hit the
> tracking issue as we are calling connector object multiple times and having
> multiple agents running parallely.
>
> Please suggest.
>
> Regards.
>
> On Tue, Sep 16, 2014 at 11:59 AM, Karl Wright <da...@gmail.com> wrote:
>
>> Hi Lalit,
>>
>> So, let me clarify: you want some independent measure as to whether every
>> document seeded, per job, has been in fact processed?
>>
>> If that is a correct statement, there is by definition no "in code" way
>> to do it, since there are multiple agents running in your setup. Each agent
>> may process some of the documents, and certainly no agent will process all
>> of them.  Also, restarting any agents process will lose the information you
>> are attempting to record.
>>
>> So you are stuck with three possibilities:
>>
>> The first possibility is to use [INFO] statements written to the log.
>> This would work, but you don't have the information you need in your
>> connector (specifically the job ID), so you would have to add these logging
>> statements to various places in the ManifoldCF framework.
>>
>> The second possibility is to make use of the history database table,
>> where events are recorded.  You could create two new activity types, also
>> written within the framework, for tracking seeding of records and for
>> tracking processing of records.  There are already activity types for job
>> start and end.
>>
>> Finally, the third possibility: If you must absolutely avoid the file
>> system, you would have to write a tracking process which allowed ManifoldCF
>> threads to connect via sockets and communicate document seeding and
>> processing events.  Once again, within the framework, you would transmit
>> events to the recording process.  This system would be at risk of losing
>> tracking data when your tracking process needed to be restarted, however.
>>
>> None of these are trivial to implement.  Essentially, keeping track of
>> documents is what MCF uses the database for in the first place, so this
>> requirement is like insisting that there be a second ManifoldCF there to be
>> sure that the first one did the right thing.  It's an incredible waste of
>> resources, frankly.  Using the log is perhaps the simplest to implement and
>> most consistent with what clients might be expecting, but it has very
>> significant I/O costs.  Using the history table has a similar problem,
>> while also putting your database under load.  The last solution requires a
>> lot of well-constructed code and remains vulnerable to system instability.
>> Take your pick.
>>
>> Karl
>>
>>
>> Thanks,
>> Karl
>>
>>
>> On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <la...@gmail.com>
>> wrote:
>>
>>> Greetings ,
>>>
>>> As part of implementation, i need to put a reconciliation mechanism in
>>> place where it can be verified how many documents have been crawled for a
>>> job and same can be displayed in logs.
>>>
>>> First thing came into my mind is to put counters in e.g. CMIS connector
>>> code in addSeed() and proecessDocuments() methods and increase it as we
>>> progress but as i could see for CMIS that CmisRepositoryConnector.java is
>>> getting called for each seeded document to be ingested, these counters are
>>> not accurate. Is there any method where i can persist these counters within
>>> code itself as i do not want to persist them in file system.
>>>
>>> Please suggest.
>>> --
>>> Regards,
>>> Lalit.
>>>
>>
>>
>
>
> --
> Regards,
> Lalit.
>

Re: Reconciliation of documents crawled

Posted by lalit jangra <la...@gmail.com>.

Thanks Karl,

As compared to all three methods suggested by you, i believe writing to
file would be easier, correct me if i am wrong.

What i initially thought that while job is running, i need to write counter
values for each document seeded and processed as we are calling
addSeedDocument() & processDocument() methods for each document. In this
case, it would not be easy to reconcile after job is complete as i do have
loads of data once job finishes and mapping them would be tough. This is
why i am trying to avoid file based mechanism. Also i would hit the
tracking issue as we are calling connector object multiple times and having
multiple agents running parallely.

Please suggest.

Regards.

On Tue, Sep 16, 2014 at 11:59 AM, Karl Wright <da...@gmail.com> wrote:

> Hi Lalit,
>
> So, let me clarify: you want some independent measure as to whether every
> document seeded, per job, has been in fact processed?
>
> If that is a correct statement, there is by definition no "in code" way to
> do it, since there are multiple agents running in your setup. Each agent
> may process some of the documents, and certainly no agent will process all
> of them.  Also, restarting any agents process will lose the information you
> are attempting to record.
>
> So you are stuck with three possibilities:
>
> The first possibility is to use [INFO] statements written to the log.
> This would work, but you don't have the information you need in your
> connector (specifically the job ID), so you would have to add these logging
> statements to various places in the ManifoldCF framework.
>
> The second possibility is to make use of the history database table, where
> events are recorded.  You could create two new activity types, also written
> within the framework, for tracking seeding of records and for tracking
> processing of records.  There are already activity types for job start and
> end.
>
> Finally, the third possibility: If you must absolutely avoid the file
> system, you would have to write a tracking process which allowed ManifoldCF
> threads to connect via sockets and communicate document seeding and
> processing events.  Once again, within the framework, you would transmit
> events to the recording process.  This system would be at risk of losing
> tracking data when your tracking process needed to be restarted, however.
>
> None of these are trivial to implement.  Essentially, keeping track of
> documents is what MCF uses the database for in the first place, so this
> requirement is like insisting that there be a second ManifoldCF there to be
> sure that the first one did the right thing.  It's an incredible waste of
> resources, frankly.  Using the log is perhaps the simplest to implement and
> most consistent with what clients might be expecting, but it has very
> significant I/O costs.  Using the history table has a similar problem,
> while also putting your database under load.  The last solution requires a
> lot of well-constructed code and remains vulnerable to system instability.
> Take your pick.
>
> Karl
>
>
> Thanks,
> Karl
>
>
> On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <la...@gmail.com>
> wrote:
>
>> Greetings ,
>>
>> As part of implementation, i need to put a reconciliation mechanism in
>> place where it can be verified how many documents have been crawled for a
>> job and same can be displayed in logs.
>>
>> First thing came into my mind is to put counters in e.g. CMIS connector
>> code in addSeed() and proecessDocuments() methods and increase it as we
>> progress but as i could see for CMIS that CmisRepositoryConnector.java is
>> getting called for each seeded document to be ingested, these counters are
>> not accurate. Is there any method where i can persist these counters within
>> code itself as i do not want to persist them in file system.
>>
>> Please suggest.
>> --
>> Regards,
>> Lalit.
>>
>
>


-- 
Regards,
Lalit.

Re: Reconciliation of documents crawled

Posted by Karl Wright <da...@gmail.com>.

Hi Lalit,

So, let me clarify: you want some independent measure as to whether every
document seeded, per job, has been in fact processed?

If that is a correct statement, there is by definition no "in code" way to
do it, since there are multiple agents running in your setup. Each agent
may process some of the documents, and certainly no agent will process all
of them.  Also, restarting any agents process will lose the information you
are attempting to record.

So you are stuck with three possibilities:

The first possibility is to use [INFO] statements written to the log.  This
would work, but you don't have the information you need in your connector
(specifically the job ID), so you would have to add these logging
statements to various places in the ManifoldCF framework.

The second possibility is to make use of the history database table, where
events are recorded.  You could create two new activity types, also written
within the framework, for tracking seeding of records and for tracking
processing of records.  There are already activity types for job start and
end.

Finally, the third possibility: If you must absolutely avoid the file
system, you would have to write a tracking process which allowed ManifoldCF
threads to connect via sockets and communicate document seeding and
processing events.  Once again, within the framework, you would transmit
events to the recording process.  This system would be at risk of losing
tracking data when your tracking process needed to be restarted, however.

None of these are trivial to implement.  Essentially, keeping track of
documents is what MCF uses the database for in the first place, so this
requirement is like insisting that there be a second ManifoldCF there to be
sure that the first one did the right thing.  It's an incredible waste of
resources, frankly.  Using the log is perhaps the simplest to implement and
most consistent with what clients might be expecting, but it has very
significant I/O costs.  Using the history table has a similar problem,
while also putting your database under load.  The last solution requires a
lot of well-constructed code and remains vulnerable to system instability.
Take your pick.

Karl

Thanks,
Karl

On Tue, Sep 16, 2014 at 12:54 AM, lalit jangra <la...@gmail.com>
wrote:

> Greetings ,
>
> As part of implementation, i need to put a reconciliation mechanism in
> place where it can be verified how many documents have been crawled for a
> job and same can be displayed in logs.
>
> First thing came into my mind is to put counters in e.g. CMIS connector
> code in addSeed() and proecessDocuments() methods and increase it as we
> progress but as i could see for CMIS that CmisRepositoryConnector.java is
> getting called for each seeded document to be ingested, these counters are
> not accurate. Is there any method where i can persist these counters within
> code itself as i do not want to persist them in file system.
>
> Please suggest.
> --
> Regards,
> Lalit.
>