You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@manifoldcf.apache.org by Shinichiro Abe <sh...@gmail.com> on 2011/06/17 01:09:50 UTC

Resume mechanism

Hi.
Please let me know about resume mechanism.

For example, when job is executing, the following things happen. 
MCF services stop, Solr shutdown, repository servers shutdown.
The job can not connect eace connectors by shutdown, it stops to ingest documents.
But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
What manages it? Does jobqueue manage this resume mechanism? 

If so, are there cases that job can not keep crawling consistency?
e.g. the following cases.
 a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
 b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
 c)Any other cases.
I want to know the possibility that data is inconsistent by halfway interrupting when crawling.

Also I want to read Part4 MCF architecture on ManifoldCFinAction.
Regards,
Shinichiro Abe

Re: Resume mechanism

Posted by Shinichiro Abe <sh...@gmail.com>.

Thank you for your reply. I understood.
I realized my test situation was unrealistic.

Shinichiro Abe

On 2011/06/22, at 21:28, Karl Wright wrote:

> It also occurs to me that a good way to test this is to write an
> IDBInterface implementation that wraps another implementation, which
> generates errors when you want it to.  Then you can verify that the
> system behaves as designed under the conditions of database errors
> occurring.
> 
> Karl
> 
> On Wed, Jun 22, 2011 at 5:10 AM, Karl Wright <da...@gmail.com> wrote:
>> So, to paraphrase, you are concerned about whether the state of the
>> outside world is consistent with the state of ManifoldCF's database in
>> all situations?
>> 
>> If you deliberately corrupt the database manually, you are not
>> simulating a realistic situation, because ManifoldCF is very careful
>> to make sure that the rows in the database match the state of the
>> outside world.  This is done by ensuring that the order in which
>> ManifoldCF updates its database is always conservative and in the
>> correct order.  What you effectively did in your test was lobotomize
>> it by removing a chunk of its memory, but this could not have happened
>> without your direct manipulation, even with poor communication to the
>> database.  In some cases ManifoldCF relies on the fact that job will
>> be rerun in order for there to be a complete crawl, but it should
>> never lose track of what it is doing.
>> 
>> For example, take the situation where ManifoldCF discovers a document,
>> then fetches it, then indexes it.  The document is entered in the
>> jobqueue upon discovery, which is an atomic operation that either
>> succeeds or fails.  If this fails, the job is aborted but the parent
>> document is not updated as having been processed either, so it will be
>> retried on the next job run.  The indexing also must complete before
>> the state of the document in the jobqueue is altered, and thus the
>> document will be retried if the indexing fails.  The maintenance of
>> the ingeststatus table uses a two-phase commit as well to be sure that
>> the status of the document in the index is accurately maintained in
>> the table regardless of target system problems or database issues.
>> 
>> Karl
>> 
>> On Wed, Jun 22, 2011 at 2:34 AM, Shinichiro Abe
>> <sh...@gmail.com> wrote:
>>> Hi.
>>> 
>>> I understood MCF resilience.
>>> However, is it possible that a) and b) occur?
>>> I tested the following.
>>> I stopped MCF while MCF starts to crawl.
>>> And I deleted a half of rows in jobqueue table manually.
>>> Then I restarted MCF and MCF began to crawl.
>>> As a result, a few documents were not insert job queue table and also ingest to Solr.
>>> It seemed that the job fails to insert documents into queue.
>>> Although this test was a intentional case, I want to check the situation in which jobqueue data is short and inconsistent.
>>> For example, if postgreSQL are stopped suddenly and can not connect, could this situation happen?
>>> Or, JobManager manage to cover it?If you know, please let me know.
>>> 
>>> Thank you,
>>> Shinichiro Abe
>>> 
>>> On 2011/06/17, at 10:22, Karl Wright wrote:
>>> 
>>>> Hi Shinichiro,
>>>> 
>>>> All of ManifoldCF's state information is in the database, which
>>>> maintains consistency because it is ACID.  You can stop the ManifoldCF
>>>> agents process and start it up again, and the crawl will begin where
>>>> it stopped.  The framework has been very carefully designed to not get
>>>> confused in any way when this is done.  This resilience is in fact one
>>>> of the primary design criteria of ManifoldCF.
>>>> 
>>>> Exactly how crawls are done is covered in ManifoldCF in Action,
>>>> chapters 11 and 12.  I'll send those to you privately.
>>>> 
>>>> Thanks,
>>>> Karl
>>>> 
>>>> On Thu, Jun 16, 2011 at 7:09 PM, Shinichiro Abe
>>>> <sh...@gmail.com> wrote:
>>>>> Hi.
>>>>> Please let me know about resume mechanism.
>>>>> 
>>>>> For example, when job is executing, the following things happen.
>>>>> MCF services stop, Solr shutdown, repository servers shutdown.
>>>>> The job can not connect eace connectors by shutdown, it stops to ingest documents.
>>>>> But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
>>>>> What manages it? Does jobqueue manage this resume mechanism?
>>>>> 
>>>>> If so, are there cases that job can not keep crawling consistency?
>>>>> e.g. the following cases.
>>>>>  a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
>>>>>  b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
>>>>>  c)Any other cases.
>>>>> I want to know the possibility that data is inconsistent by halfway interrupting when crawling.
>>>>> 
>>>>> Also I want to read Part4 MCF architecture on ManifoldCFinAction.
>>>>> Regards,
>>>>> Shinichiro Abe
>>> 
>>> 
>>

Re: Resume mechanism

Posted by Karl Wright <da...@gmail.com>.

It also occurs to me that a good way to test this is to write an
IDBInterface implementation that wraps another implementation, which
generates errors when you want it to.  Then you can verify that the
system behaves as designed under the conditions of database errors
occurring.

Karl

On Wed, Jun 22, 2011 at 5:10 AM, Karl Wright <da...@gmail.com> wrote:
> So, to paraphrase, you are concerned about whether the state of the
> outside world is consistent with the state of ManifoldCF's database in
> all situations?
>
> If you deliberately corrupt the database manually, you are not
> simulating a realistic situation, because ManifoldCF is very careful
> to make sure that the rows in the database match the state of the
> outside world.  This is done by ensuring that the order in which
> ManifoldCF updates its database is always conservative and in the
> correct order.  What you effectively did in your test was lobotomize
> it by removing a chunk of its memory, but this could not have happened
> without your direct manipulation, even with poor communication to the
> database.  In some cases ManifoldCF relies on the fact that job will
> be rerun in order for there to be a complete crawl, but it should
> never lose track of what it is doing.
>
> For example, take the situation where ManifoldCF discovers a document,
> then fetches it, then indexes it.  The document is entered in the
> jobqueue upon discovery, which is an atomic operation that either
> succeeds or fails.  If this fails, the job is aborted but the parent
> document is not updated as having been processed either, so it will be
> retried on the next job run.  The indexing also must complete before
> the state of the document in the jobqueue is altered, and thus the
> document will be retried if the indexing fails.  The maintenance of
> the ingeststatus table uses a two-phase commit as well to be sure that
> the status of the document in the index is accurately maintained in
> the table regardless of target system problems or database issues.
>
> Karl
>
> On Wed, Jun 22, 2011 at 2:34 AM, Shinichiro Abe
> <sh...@gmail.com> wrote:
>> Hi.
>>
>> I understood MCF resilience.
>> However, is it possible that a) and b) occur?
>> I tested the following.
>> I stopped MCF while MCF starts to crawl.
>> And I deleted a half of rows in jobqueue table manually.
>> Then I restarted MCF and MCF began to crawl.
>> As a result, a few documents were not insert job queue table and also ingest to Solr.
>> It seemed that the job fails to insert documents into queue.
>> Although this test was a intentional case, I want to check the situation in which jobqueue data is short and inconsistent.
>> For example, if postgreSQL are stopped suddenly and can not connect, could this situation happen?
>> Or, JobManager manage to cover it?If you know, please let me know.
>>
>> Thank you,
>> Shinichiro Abe
>>
>> On 2011/06/17, at 10:22, Karl Wright wrote:
>>
>>> Hi Shinichiro,
>>>
>>> All of ManifoldCF's state information is in the database, which
>>> maintains consistency because it is ACID.  You can stop the ManifoldCF
>>> agents process and start it up again, and the crawl will begin where
>>> it stopped.  The framework has been very carefully designed to not get
>>> confused in any way when this is done.  This resilience is in fact one
>>> of the primary design criteria of ManifoldCF.
>>>
>>> Exactly how crawls are done is covered in ManifoldCF in Action,
>>> chapters 11 and 12.  I'll send those to you privately.
>>>
>>> Thanks,
>>> Karl
>>>
>>> On Thu, Jun 16, 2011 at 7:09 PM, Shinichiro Abe
>>> <sh...@gmail.com> wrote:
>>>> Hi.
>>>> Please let me know about resume mechanism.
>>>>
>>>> For example, when job is executing, the following things happen.
>>>> MCF services stop, Solr shutdown, repository servers shutdown.
>>>> The job can not connect eace connectors by shutdown, it stops to ingest documents.
>>>> But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
>>>> What manages it? Does jobqueue manage this resume mechanism?
>>>>
>>>> If so, are there cases that job can not keep crawling consistency?
>>>> e.g. the following cases.
>>>>  a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
>>>>  b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
>>>>  c)Any other cases.
>>>> I want to know the possibility that data is inconsistent by halfway interrupting when crawling.
>>>>
>>>> Also I want to read Part4 MCF architecture on ManifoldCFinAction.
>>>> Regards,
>>>> Shinichiro Abe
>>
>>
>

Re: Resume mechanism

Posted by Karl Wright <da...@gmail.com>.

So, to paraphrase, you are concerned about whether the state of the
outside world is consistent with the state of ManifoldCF's database in
all situations?

If you deliberately corrupt the database manually, you are not
simulating a realistic situation, because ManifoldCF is very careful
to make sure that the rows in the database match the state of the
outside world.  This is done by ensuring that the order in which
ManifoldCF updates its database is always conservative and in the
correct order.  What you effectively did in your test was lobotomize
it by removing a chunk of its memory, but this could not have happened
without your direct manipulation, even with poor communication to the
database.  In some cases ManifoldCF relies on the fact that job will
be rerun in order for there to be a complete crawl, but it should
never lose track of what it is doing.

For example, take the situation where ManifoldCF discovers a document,
then fetches it, then indexes it.  The document is entered in the
jobqueue upon discovery, which is an atomic operation that either
succeeds or fails.  If this fails, the job is aborted but the parent
document is not updated as having been processed either, so it will be
retried on the next job run.  The indexing also must complete before
the state of the document in the jobqueue is altered, and thus the
document will be retried if the indexing fails.  The maintenance of
the ingeststatus table uses a two-phase commit as well to be sure that
the status of the document in the index is accurately maintained in
the table regardless of target system problems or database issues.

Karl

On Wed, Jun 22, 2011 at 2:34 AM, Shinichiro Abe
<sh...@gmail.com> wrote:
> Hi.
>
> I understood MCF resilience.
> However, is it possible that a) and b) occur?
> I tested the following.
> I stopped MCF while MCF starts to crawl.
> And I deleted a half of rows in jobqueue table manually.
> Then I restarted MCF and MCF began to crawl.
> As a result, a few documents were not insert job queue table and also ingest to Solr.
> It seemed that the job fails to insert documents into queue.
> Although this test was a intentional case, I want to check the situation in which jobqueue data is short and inconsistent.
> For example, if postgreSQL are stopped suddenly and can not connect, could this situation happen?
> Or, JobManager manage to cover it?If you know, please let me know.
>
> Thank you,
> Shinichiro Abe
>
> On 2011/06/17, at 10:22, Karl Wright wrote:
>
>> Hi Shinichiro,
>>
>> All of ManifoldCF's state information is in the database, which
>> maintains consistency because it is ACID.  You can stop the ManifoldCF
>> agents process and start it up again, and the crawl will begin where
>> it stopped.  The framework has been very carefully designed to not get
>> confused in any way when this is done.  This resilience is in fact one
>> of the primary design criteria of ManifoldCF.
>>
>> Exactly how crawls are done is covered in ManifoldCF in Action,
>> chapters 11 and 12.  I'll send those to you privately.
>>
>> Thanks,
>> Karl
>>
>> On Thu, Jun 16, 2011 at 7:09 PM, Shinichiro Abe
>> <sh...@gmail.com> wrote:
>>> Hi.
>>> Please let me know about resume mechanism.
>>>
>>> For example, when job is executing, the following things happen.
>>> MCF services stop, Solr shutdown, repository servers shutdown.
>>> The job can not connect eace connectors by shutdown, it stops to ingest documents.
>>> But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
>>> What manages it? Does jobqueue manage this resume mechanism?
>>>
>>> If so, are there cases that job can not keep crawling consistency?
>>> e.g. the following cases.
>>>  a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
>>>  b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
>>>  c)Any other cases.
>>> I want to know the possibility that data is inconsistent by halfway interrupting when crawling.
>>>
>>> Also I want to read Part4 MCF architecture on ManifoldCFinAction.
>>> Regards,
>>> Shinichiro Abe
>
>

Re: Resume mechanism

Posted by Shinichiro Abe <sh...@gmail.com>.

Hi.

I understood MCF resilience.
However, is it possible that a) and b) occur?
I tested the following. 
I stopped MCF while MCF starts to crawl. 
And I deleted a half of rows in jobqueue table manually.
Then I restarted MCF and MCF began to crawl. 
As a result, a few documents were not insert job queue table and also ingest to Solr. 
It seemed that the job fails to insert documents into queue.
Although this test was a intentional case, I want to check the situation in which jobqueue data is short and inconsistent. 
For example, if postgreSQL are stopped suddenly and can not connect, could this situation happen? 
Or, JobManager manage to cover it?If you know, please let me know.

Thank you,
Shinichiro Abe

On 2011/06/17, at 10:22, Karl Wright wrote:

> Hi Shinichiro,
> 
> All of ManifoldCF's state information is in the database, which
> maintains consistency because it is ACID.  You can stop the ManifoldCF
> agents process and start it up again, and the crawl will begin where
> it stopped.  The framework has been very carefully designed to not get
> confused in any way when this is done.  This resilience is in fact one
> of the primary design criteria of ManifoldCF.
> 
> Exactly how crawls are done is covered in ManifoldCF in Action,
> chapters 11 and 12.  I'll send those to you privately.
> 
> Thanks,
> Karl
> 
> On Thu, Jun 16, 2011 at 7:09 PM, Shinichiro Abe
> <sh...@gmail.com> wrote:
>> Hi.
>> Please let me know about resume mechanism.
>> 
>> For example, when job is executing, the following things happen.
>> MCF services stop, Solr shutdown, repository servers shutdown.
>> The job can not connect eace connectors by shutdown, it stops to ingest documents.
>> But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
>> What manages it? Does jobqueue manage this resume mechanism?
>> 
>> If so, are there cases that job can not keep crawling consistency?
>> e.g. the following cases.
>>  a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
>>  b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
>>  c)Any other cases.
>> I want to know the possibility that data is inconsistent by halfway interrupting when crawling.
>> 
>> Also I want to read Part4 MCF architecture on ManifoldCFinAction.
>> Regards,
>> Shinichiro Abe

Re: Resume mechanism

Posted by Karl Wright <da...@gmail.com>.

Hi Shinichiro,

All of ManifoldCF's state information is in the database, which
maintains consistency because it is ACID.  You can stop the ManifoldCF
agents process and start it up again, and the crawl will begin where
it stopped.  The framework has been very carefully designed to not get
confused in any way when this is done.  This resilience is in fact one
of the primary design criteria of ManifoldCF.

Exactly how crawls are done is covered in ManifoldCF in Action,
chapters 11 and 12.  I'll send those to you privately.

Thanks,
Karl

On Thu, Jun 16, 2011 at 7:09 PM, Shinichiro Abe
<sh...@gmail.com> wrote:
> Hi.
> Please let me know about resume mechanism.
>
> For example, when job is executing, the following things happen.
> MCF services stop, Solr shutdown, repository servers shutdown.
> The job can not connect eace connectors by shutdown, it stops to ingest documents.
> But when the above things are recovered, the job starts to resume ingesting, it can keep crawling consistency.
> What manages it? Does jobqueue manage this resume mechanism?
>
> If so, are there cases that job can not keep crawling consistency?
> e.g. the following cases.
>  a)Postgresql stops before inserting all into jobqueue, jobqueue data is short and inconsistent.
>  b)Though it needs to crawl a lot of documents, MCF stops before inserting all into jobqueue. As a result, jobqueue data is short and inconsistent.
>  c)Any other cases.
> I want to know the possibility that data is inconsistent by halfway interrupting when crawling.
>
> Also I want to read Part4 MCF architecture on ManifoldCFinAction.
> Regards,
> Shinichiro Abe