You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by "小林 茂樹 (情報システム本部 / サービス企画部)" <sh...@g.softbank.co.jp> on 2012/03/16 07:00:54 UTC

[ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

I was crawling web sites with links to html and pdf files on the provided
multiprocess-example agent for a few hours, then Simple History started
showing -104 result code with a message saying "Interrupted: Job no longer
active".

After the same error occurred repeatedly around 40 times, the job status
became "Aborting" and then ended up with "Error: Repeated service
interruptions
- failure processing document: Ingestion HTTP error code 500".

The job was interrupted and stopped.

Does anyone know what situation brings "Repeated service interruptions" and
has jobs stopped?
Also in what circumstance an error status code -104 occurs? What is the
meaning of the code -104?

If you have any ideas, please advise me on how to avoid this error.


I am using the followings:

Solr 1.4 (Extracting Request Handler is set)
ManifoldCF 0.4 (multiprocess-example)
- Repository connector: WEB
- Output connector: Solr
Tomcat 6.0.29
PostgreSQL 9.1.3


Here is MCF’s debug log right before the job was interrupted:

DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
connection to http://xx.xx.xx.xx:80 (95697 ms)
DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
before starting fetch on http://xx.xx.xx.xx:80
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
connection to http://xx.xx.xx.xx:80 (99593 ms)
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
connection to http://xx.xx.xx.xx:80 (99593 ms)
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
HttpClient object
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
object after 0 ms.
DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
'/xx/xx.pdf'
DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 128 ms.
DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a read
wait on bin 'xx.xx' of 62 ms.
 INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH URL|
http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|Interrupted:
Job no longer active
DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception
for 'http://xx.xx/xx/xx.pdf'
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted: Job
no longer active
        at
org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
        at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
        at
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
no longer active
        at
org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
        at
org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
        at
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
        ... 1 more
 WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
interruption reported for job 1331716457096 connection 'web': Job no longer
active
DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job 1331716457096
DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
1331716457096 in need of notification

Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.
Abe-san,

Thank you for the info.

That's a good idea. Hope I can avoid the job interruption in this way.


Regards,

Shigeki

2012/3/19 Shinichiro Abe <sh...@gmail.com>

> Hi,
>
> Currently MCF can't ignore 500 server error which is caused by Solr.
> If you can upgrade to Solr 3.2, you can specify ignoreTikaException.
> https://issues.apache.org/jira/browse/SOLR-2480
> Hope that helps.
>
> Regards,
> Shinichiro Abe
>
> On 2012/03/19, at 12:55, Shigeki Kobayashi wrote:
>
> > Karl,
> >
> >
> > Thanks for your reply.
> >
> > It seems that Tika failed on extracting documents from PDF files while
> crawling web links down. I confirmed there were Tika Exception subsequently
> to Solr Exception.
> >
> > So, Solr detecting Tika Exception sends a status code, 500, then MCF
> retries ingesting certain times:
> >
> > "500 from ingestion request; ingestion will be retried again later"
> >
> > After all, MCF shuts down the entire job.
> >
> > I know I should up grade the Solr version (including Tika), to improve a
> job in document extraction. But, the current version of Tika still fails in
> document extraction sometimes anyway, i feel it would make more sense that
> MCF ignores and proceeds after such ingestion error caused by Tika.
> >
> > Are there any such specification requests from users that MCF ignores
> and proceeds after failure of document ingestion caused by Tika, maybe in
> the next release?
> >
> > Are there any options that users can choose to have MCF ignore and
> proceed after such ingestion error?
> >
> >
> > regards,
> >
> > Shigeki
> >
> > 2012/3/16 Karl Wright <da...@gmail.com>
> > Hi Shigeki,
> >
> > A "service interruption" means that a connector (either a repository
> > connector like the web connector or an output connector like the Solr
> > connector) could not communicate with the configured service.
> >
> > "Repeated service interruptions" means that certain URLs failed to
> > fetch properly even after a pattern of retries which lasted many
> > hours.  ManifoldCF connectors deal with such errors in one of several
> > ways, depending on the exact details of the error:
> >
> > - ignore it and proceed
> > - retry periodically for some time interval, and then give up and proceed
> > - retry periodically for some time interval, and then shut down the job
> >
> > It sounds like your job has encountered one of the latter errors.  The
> > "Error: Repeated service interruptions - failure processing document:
> > Ingestion HTTP error code 500" indicates that the problem is due to
> > communication with Solr.  Apparently certain documents you are
> > indexing are causing Solr to return an error code 500, which is an
> > "internal server error", and is usually associated with a Solr
> > exception.  You will need to diagnose why this is, and take corrective
> > steps, in order for your ManifoldCF job to complete successfully.
> >
> > "Job no longer active" is harmless - it's a side effect of the job
> > shutting down.  When a job is shutting down, active document
> > processing cannot always be interrupted within a connector, but the
> > framework helps it to stop quickly by throwing this exception.
> >
> > Thanks,
> > Karl
> >
> >
> > 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <shigeki.kobayashi3@g.softbank.co.jp
> >:
> > >
> > > I was crawling web sites with links to html and pdf files on the
> provided
> > > multiprocess-example agent for a few hours, then Simple History started
> > > showing -104 result code with a message saying "Interrupted: Job no
> longer
> > > active".
> > >
> > > After the same error occurred repeatedly around 40 times, the job
> status
> > > became "Aborting" and then ended up with "Error: Repeated service
> > > interruptions
> > > - failure processing document: Ingestion HTTP error code 500".
> > >
> > > The job was interrupted and stopped.
> > >
> > > Does anyone know what situation brings "Repeated service
> interruptions" and
> > > has jobs stopped?
> > > Also in what circumstance an error status code -104 occurs? What is the
> > > meaning of the code -104?
> > >
> > > If you have any ideas, please advise me on how to avoid this error.
> > >
> > >
> > > I am using the followings:
> > >
> > > Solr 1.4 (Extracting Request Handler is set)
> > > ManifoldCF 0.4 (multiprocess-example)
> > > - Repository connector: WEB
> > > - Output connector: Solr
> > > Tomcat 6.0.29
> > > PostgreSQL 9.1.3
> > >
> > >
> > > Here is MCF’s debug log right before the job was interrupted:
> > >
> > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to
> get
> > > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895
> ms
> > > before starting fetch on http://xx.xx.xx.xx:80
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to
> get
> > > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully
> got
> > > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > > HttpClient object
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an
> HttpClient
> > > object after 0 ms.
> > > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > > '/xx/xx.pdf'
> > > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 128 ms.
> > > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a
> read
> > > wait on bin 'xx.xx' of 62 ms.
> > >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > > URL|
> http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > > Interrupted: Job no longer active
> > > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch
> exception for
> > > 'http://xx.xx/xx/xx.pdf'
> > > org.apache.manifoldcf.core.interfaces.ManifoldCFException:
> Interrupted: Job
> > > no longer active
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> > >         at
> > >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > > Caused by:
> org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> > > no longer active
> > >         at
> > >
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> > >         at
> > >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> > >         ... 1 more
> > >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > > interruption reported for job 1331716457096 connection 'web': Job no
> longer
> > > active
> > > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job
> 1331716457096
> > > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > > 1331716457096 in need of notification
> >
> >
> >
> > --
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >  ソフトバンクモバイル株式会社
> >  情報システム本部
> >  システムサービス事業統括部
> >  サービス企画部
> >
> >  小林 茂樹
> >  shigeki.kobayashi3@g.softbank.co.jp
> > ~~~~~~~~~~~~~~~~~~~~~~~~
> >
> >
> >
>
>


-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

Posted by Shinichiro Abe <sh...@gmail.com>.
Hi,

Currently MCF can't ignore 500 server error which is caused by Solr.
If you can upgrade to Solr 3.2, you can specify ignoreTikaException.
https://issues.apache.org/jira/browse/SOLR-2480
Hope that helps.

Regards, 
Shinichiro Abe

On 2012/03/19, at 12:55, Shigeki Kobayashi wrote:

> Karl,
> 
> 
> Thanks for your reply.
> 
> It seems that Tika failed on extracting documents from PDF files while crawling web links down. I confirmed there were Tika Exception subsequently to Solr Exception. 
> 
> So, Solr detecting Tika Exception sends a status code, 500, then MCF retries ingesting certain times:
> 
> "500 from ingestion request; ingestion will be retried again later"
> 
> After all, MCF shuts down the entire job.
> 
> I know I should up grade the Solr version (including Tika), to improve a job in document extraction. But, the current version of Tika still fails in document extraction sometimes anyway, i feel it would make more sense that MCF ignores and proceeds after such ingestion error caused by Tika.
>   
> Are there any such specification requests from users that MCF ignores and proceeds after failure of document ingestion caused by Tika, maybe in the next release?
> 
> Are there any options that users can choose to have MCF ignore and proceed after such ingestion error? 
> 
> 
> regards,
> 
> Shigeki
> 
> 2012/3/16 Karl Wright <da...@gmail.com>
> Hi Shigeki,
> 
> A "service interruption" means that a connector (either a repository
> connector like the web connector or an output connector like the Solr
> connector) could not communicate with the configured service.
> 
> "Repeated service interruptions" means that certain URLs failed to
> fetch properly even after a pattern of retries which lasted many
> hours.  ManifoldCF connectors deal with such errors in one of several
> ways, depending on the exact details of the error:
> 
> - ignore it and proceed
> - retry periodically for some time interval, and then give up and proceed
> - retry periodically for some time interval, and then shut down the job
> 
> It sounds like your job has encountered one of the latter errors.  The
> "Error: Repeated service interruptions - failure processing document:
> Ingestion HTTP error code 500" indicates that the problem is due to
> communication with Solr.  Apparently certain documents you are
> indexing are causing Solr to return an error code 500, which is an
> "internal server error", and is usually associated with a Solr
> exception.  You will need to diagnose why this is, and take corrective
> steps, in order for your ManifoldCF job to complete successfully.
> 
> "Job no longer active" is harmless - it's a side effect of the job
> shutting down.  When a job is shutting down, active document
> processing cannot always be interrupted within a connector, but the
> framework helps it to stop quickly by throwing this exception.
> 
> Thanks,
> Karl
> 
> 
> 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <sh...@g.softbank.co.jp>:
> >
> > I was crawling web sites with links to html and pdf files on the provided
> > multiprocess-example agent for a few hours, then Simple History started
> > showing -104 result code with a message saying "Interrupted: Job no longer
> > active".
> >
> > After the same error occurred repeatedly around 40 times, the job status
> > became "Aborting" and then ended up with "Error: Repeated service
> > interruptions
> > - failure processing document: Ingestion HTTP error code 500".
> >
> > The job was interrupted and stopped.
> >
> > Does anyone know what situation brings "Repeated service interruptions" and
> > has jobs stopped?
> > Also in what circumstance an error status code -104 occurs? What is the
> > meaning of the code -104?
> >
> > If you have any ideas, please advise me on how to avoid this error.
> >
> >
> > I am using the followings:
> >
> > Solr 1.4 (Extracting Request Handler is set)
> > ManifoldCF 0.4 (multiprocess-example)
> > - Repository connector: WEB
> > - Output connector: Solr
> > Tomcat 6.0.29
> > PostgreSQL 9.1.3
> >
> >
> > Here is MCF’s debug log right before the job was interrupted:
> >
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
> > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> > before starting fetch on http://xx.xx.xx.xx:80
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > HttpClient object
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
> > object after 0 ms.
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > '/xx/xx.pdf'
> > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 128 ms.
> > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a read
> > wait on bin 'xx.xx' of 62 ms.
> >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > URL|http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > Interrupted: Job no longer active
> > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception for
> > 'http://xx.xx/xx/xx.pdf'
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted: Job
> > no longer active
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> > no longer active
> >         at
> > org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> >         at
> > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> >         ... 1 more
> >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > interruption reported for job 1331716457096 connection 'web': Job no longer
> > active
> > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job 1331716457096
> > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > 1331716457096 in need of notification
> 
> 
> 
> -- 
> ~~~~~~~~~~~~~~~~~~~~~~~~
>  ソフトバンクモバイル株式会社
>  情報システム本部
>  システムサービス事業統括部
>  サービス企画部
>  
>  小林 茂樹
>  shigeki.kobayashi3@g.softbank.co.jp
> ~~~~~~~~~~~~~~~~~~~~~~~~
>  
> 
> 


Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

Posted by Shigeki Kobayashi <sh...@g.softbank.co.jp>.
Karl,


Thanks for your reply.

It seems that Tika failed on extracting documents from PDF files while
crawling web links down. I confirmed there were Tika Exception subsequently
to Solr Exception.

So, Solr detecting Tika Exception sends a status code, 500, then MCF
retries ingesting certain times:

"500 from ingestion request; ingestion will be retried again later"

After all, MCF shuts down the entire job.

I know I should up grade the Solr version (including Tika), to improve a
job in document extraction. But, the current version of Tika still fails in
document extraction sometimes anyway, i feel it would make more sense that
MCF ignores and proceeds after such ingestion error caused by Tika.

Are there any such specification requests from users that MCF ignores and
proceeds after failure of document ingestion caused by Tika, maybe in the
next release?

Are there any options that users can choose to have MCF ignore and proceed
after such ingestion error?


regards,

Shigeki

2012/3/16 Karl Wright <da...@gmail.com>

> Hi Shigeki,
>
> A "service interruption" means that a connector (either a repository
> connector like the web connector or an output connector like the Solr
> connector) could not communicate with the configured service.
>
> "Repeated service interruptions" means that certain URLs failed to
> fetch properly even after a pattern of retries which lasted many
> hours.  ManifoldCF connectors deal with such errors in one of several
> ways, depending on the exact details of the error:
>
> - ignore it and proceed
> - retry periodically for some time interval, and then give up and proceed
> - retry periodically for some time interval, and then shut down the job
>
> It sounds like your job has encountered one of the latter errors.  The
> "Error: Repeated service interruptions - failure processing document:
> Ingestion HTTP error code 500" indicates that the problem is due to
> communication with Solr.  Apparently certain documents you are
> indexing are causing Solr to return an error code 500, which is an
> "internal server error", and is usually associated with a Solr
> exception.  You will need to diagnose why this is, and take corrective
> steps, in order for your ManifoldCF job to complete successfully.
>
> "Job no longer active" is harmless - it's a side effect of the job
> shutting down.  When a job is shutting down, active document
> processing cannot always be interrupted within a connector, but the
> framework helps it to stop quickly by throwing this exception.
>
> Thanks,
> Karl
>
>
> 2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <sh...@g.softbank.co.jp>:
> >
> > I was crawling web sites with links to html and pdf files on the provided
> > multiprocess-example agent for a few hours, then Simple History started
> > showing -104 result code with a message saying "Interrupted: Job no
> longer
> > active".
> >
> > After the same error occurred repeatedly around 40 times, the job status
> > became "Aborting" and then ended up with "Error: Repeated service
> > interruptions
> > - failure processing document: Ingestion HTTP error code 500".
> >
> > The job was interrupted and stopped.
> >
> > Does anyone know what situation brings "Repeated service interruptions"
> and
> > has jobs stopped?
> > Also in what circumstance an error status code -104 occurs? What is the
> > meaning of the code -104?
> >
> > If you have any ideas, please advise me on how to avoid this error.
> >
> >
> > I am using the followings:
> >
> > Solr 1.4 (Extracting Request Handler is set)
> > ManifoldCF 0.4 (multiprocess-example)
> > - Repository connector: WEB
> > - Output connector: Solr
> > Tomcat 6.0.29
> > PostgreSQL 9.1.3
> >
> >
> > Here is MCF’s debug log right before the job was interrupted:
> >
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to
> get
> > connection to http://xx.xx.xx.xx:80 (95697 ms)
> > DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> > before starting fetch on http://xx.xx.xx.xx:80
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to
> get
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> > connection to http://xx.xx.xx.xx:80 (99593 ms)
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> > HttpClient object
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an
> HttpClient
> > object after 0 ms.
> > DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> > '/xx/xx.pdf'
> > DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> > http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> > DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 128 ms.
> > DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> > DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a
> read
> > wait on bin 'xx.xx' of 62 ms.
> >  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> > URL|
> http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> > Interrupted: Job no longer active
> > DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception
> for
> > 'http://xx.xx/xx/xx.pdf'
> > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted:
> Job
> > no longer active
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> > Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption:
> Job
> > no longer active
> >         at
> >
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
> >         at
> >
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
> >         ... 1 more
> >  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> > interruption reported for job 1331716457096 connection 'web': Job no
> longer
> > active
> > DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job
> 1331716457096
> > DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> > 1331716457096 in need of notification
>



-- 
*~~~~~~~~~~~~~~~~~~~~**~~~~*
 ソフトバンクモバイル株式会社
 情報システム本部
 システムサービス事業統括部
 サービス企画部

 小林 茂樹
 shigeki.kobayashi3@g.softbank.co.jp
*~~~~~~~~~~~~~~~~~~~~**~~~~*

Re: [ManifoldCF] Crawling with the WEB repository connector causes Repeated service interruptions

Posted by Karl Wright <da...@gmail.com>.
Hi Shigeki,

A "service interruption" means that a connector (either a repository
connector like the web connector or an output connector like the Solr
connector) could not communicate with the configured service.

"Repeated service interruptions" means that certain URLs failed to
fetch properly even after a pattern of retries which lasted many
hours.  ManifoldCF connectors deal with such errors in one of several
ways, depending on the exact details of the error:

- ignore it and proceed
- retry periodically for some time interval, and then give up and proceed
- retry periodically for some time interval, and then shut down the job

It sounds like your job has encountered one of the latter errors.  The
"Error: Repeated service interruptions - failure processing document:
Ingestion HTTP error code 500" indicates that the problem is due to
communication with Solr.  Apparently certain documents you are
indexing are causing Solr to return an error code 500, which is an
"internal server error", and is usually associated with a Solr
exception.  You will need to diagnose why this is, and take corrective
steps, in order for your ManifoldCF job to complete successfully.

"Job no longer active" is harmless - it's a side effect of the job
shutting down.  When a job is shutting down, active document
processing cannot always be interrupted within a connector, but the
framework helps it to stop quickly by throwing this exception.

Thanks,
Karl


2012/3/16 小林 茂樹(情報システム本部 / サービス企画部) <sh...@g.softbank.co.jp>:
>
> I was crawling web sites with links to html and pdf files on the provided
> multiprocess-example agent for a few hours, then Simple History started
> showing -104 result code with a message saying "Interrupted: Job no longer
> active".
>
> After the same error occurred repeatedly around 40 times, the job status
> became "Aborting" and then ended up with "Error: Repeated service
> interruptions
> - failure processing document: Ingestion HTTP error code 500".
>
> The job was interrupted and stopped.
>
> Does anyone know what situation brings "Repeated service interruptions" and
> has jobs stopped?
> Also in what circumstance an error status code -104 occurs? What is the
> meaning of the code -104?
>
> If you have any ideas, please advise me on how to avoid this error.
>
>
> I am using the followings:
>
> Solr 1.4 (Extracting Request Handler is set)
> ManifoldCF 0.4 (multiprocess-example)
> - Repository connector: WEB
> - Output connector: Solr
> Tomcat 6.0.29
> PostgreSQL 9.1.3
>
>
> Here is MCF’s debug log right before the job was interrupted:
>
> DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Attempting to get
> connection to http://xx.xx.xx.xx:80 (95697 ms)
> DEBUG 2012-03-15 20:04:16,325 (Worker thread '4') - WEB: Waiting 3895 ms
> before starting fetch on http://xx.xx.xx.xx:80
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Attempting to get
> connection to http://xx.xx.xx.xx:80 (99593 ms)
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Successfully got
> connection to http://xx.xx.xx.xx:80 (99593 ms)
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Waiting for an
> HttpClient object
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Got an HttpClient
> object after 0 ms.
> DEBUG 2012-03-15 20:04:20,221 (Worker thread '4') - WEB: Get method for
> '/xx/xx.pdf'
> DEBUG 2012-03-15 20:04:20,222 (Worker thread '4') - WEB: For
> http://xx.xx/xx/xx.pdf, setting virtual host to xx.xx
> DEBUG 2012-03-15 20:04:20,315 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 128 ms.
> DEBUG 2012-03-15 20:04:20,445 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,509 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,573 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,637 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,701 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,765 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,829 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,893 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:20,957 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,021 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,085 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,149 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,213 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
> DEBUG 2012-03-15 20:04:21,277 (Worker thread '4') - WEB: Performing a read
> wait on bin 'xx.xx' of 62 ms.
>  INFO 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: FETCH
> URL|http://xx.xx/xx/xx.pdf|1331809460221+1122|-104|65536|org.apache.manifoldcf.core.interfaces.ManifoldCFException|
> Interrupted: Job no longer active
> DEBUG 2012-03-15 20:04:21,344 (Worker thread '4') - WEB: Fetch exception for
> 'http://xx.xx/xx/xx.pdf'
> org.apache.manifoldcf.core.interfaces.ManifoldCFException: Interrupted: Job
> no longer active
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.ThrottledFetcher$ThrottledConnection.noteInterrupted(ThrottledFetcher.java:1735)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:743)
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:318)
> Caused by: org.apache.manifoldcf.agents.interfaces.ServiceInterruption: Job
> no longer active
>         at
> org.apache.manifoldcf.crawler.system.WorkerThread$VersionActivity.checkJobStillActive(WorkerThread.java:1223)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.DataCache.addData(DataCache.java:135)
>         at
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.getDocumentVersions(WebcrawlerConnector.java:713)
>         ... 1 more
>  WARN 2012-03-15 20:04:21,345 (Worker thread '4') - Pre-ingest service
> interruption reported for job 1331716457096 connection 'web': Job no longer
> active
> DEBUG 2012-03-15 20:04:23,871 (Job reset thread) - Stopped job 1331716457096
> DEBUG 2012-03-15 20:04:24,236 (Job notification thread) - Found job
> 1331716457096 in need of notification