You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by "Delapasse, Deanna" <dd...@oceaneering.com> on 2015/06/18 14:31:20 UTC

Aborting job doesn't clear the queue

I'm using a very simple job CMIS->ElasticSearch, I noticed that after I
abort/restart a fairly large job (2800 rows) I start getting:

[SyncThread:0] WARN org.apache.zookeeper.server.persistence.FileTxnLog -
fsync-ing the write ahead log in SyncThread:0 took 1077ms which will
adversely effect operation latency. See the ZooKeeper troubleshooting guide

When I abort the job, the status goes to 'Aborted' for a few minutes and
then finally to 'Done'.  The row has Documents=2847, Active=692,
Processed=2255.
I poked around into the database  there are still 2847 rows in the jobqueue
table and 2155 rows in ingeststatus.

Should I just write a script to 'purge' these tables when I need to
restart?  Are there other tables I should also check/clear?  I'm using
postgresql with Zookeeper.  There is nothing in the manifoldCF logfile.
The zookeeper warnings just show up in the console where I started
manifoldCF.

thanks!
Deanna

Re: Aborting job doesn't clear the queue

Posted by Karl Wright <da...@gmail.com>.
Hi Deanna,

MCF is an incremental crawler where jobs are meant to be run repeatedly in
order to synchronize the repository with the output.  It therefore keeps
information between job runs in order to be able to do the minimum work
(and also to know what to delete) on subsequent job runs.

What you are seeing is perfectly normal and expected.

Thanks,
Karl


On Thu, Jun 18, 2015 at 8:31 AM, Delapasse, Deanna <
ddelapasse@oceaneering.com> wrote:

> I'm using a very simple job CMIS->ElasticSearch, I noticed that after I
> abort/restart a fairly large job (2800 rows) I start getting:
>
> [SyncThread:0] WARN org.apache.zookeeper.server.persistence.FileTxnLog -
> fsync-ing the write ahead log in SyncThread:0 took 1077ms which will
> adversely effect operation latency. See the ZooKeeper troubleshooting guide
>
> When I abort the job, the status goes to 'Aborted' for a few minutes and
> then finally to 'Done'.  The row has Documents=2847, Active=692,
> Processed=2255.
> I poked around into the database  there are still 2847 rows in the
> jobqueue table and 2155 rows in ingeststatus.
>
> Should I just write a script to 'purge' these tables when I need to
> restart?  Are there other tables I should also check/clear?  I'm using
> postgresql with Zookeeper.  There is nothing in the manifoldCF logfile.
> The zookeeper warnings just show up in the console where I started
> manifoldCF.
>
> thanks!
> Deanna
>
>
>