You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@manifoldcf.apache.org by Maciej Liżewski <ma...@gmail.com> on 2012/09/10 15:47:41 UTC

question about error handling during indexing

Hi,

I have found situation when Solr throws exception that it is not able to
parse specified file, like this:
INFO: [collection1] webapp=/solr path=/update/extract
params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
{} 0 269
2012-09-10 15:34:50 org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
        at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)

Now - I can live with that, I do not expect it to index everything, but I
am not sure if Manifold should react the way it is - it just stops indexing
anything more from such job (and in fact it shuts down job execution) where
it should try to index other pending files... Now I must run indexing by
hand, check if everything is ok, when there is such problem - add proper
"exclude" filter (which leads to Manifold does not index this kind of files
at all, but problem could be with only this specific single file), and run
it again. Still - I have to guarantee that it won't fail in future on some
other file...

Don't you think that Manifold should try to index everything *even* when
there are problems with indexing some documents?

I am just not sure if this is bug or feature... :)

RE: question about error handling during indexing

Posted by Adrian Conlon <Ad...@arup.com>.
I think the preferred solution at the moment is to use the "ignoreTikaException" flag in the update/extract portion of your "solrconfig.xml" configuration.

Having used this in anger, I can confirm is does successfully allow document ingestion to continue where Tika parse errors have occurred.

HTH,

Adrian

-----Original Message-----
From: Maciej Liżewski [mailto:maciej.lizewski@gmail.com] 
Sent: 10 September 2012 14:48
To: dev@manifoldcf.apache.org
Subject: question about error handling during indexing

Hi,

I have found situation when Solr throws exception that it is not able to parse specified file, like this:
INFO: [collection1] webapp=/solr path=/update/extract params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
{} 0 269
2012-09-10 15:34:50 org.apache.solr.common.SolrException log
SEVERE: org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
        at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
        at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
        at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)

Now - I can live with that, I do not expect it to index everything, but I am not sure if Manifold should react the way it is - it just stops indexing anything more from such job (and in fact it shuts down job execution) where it should try to index other pending files... Now I must run indexing by hand, check if everything is ok, when there is such problem - add proper "exclude" filter (which leads to Manifold does not index this kind of files at all, but problem could be with only this specific single file), and run it again. Still - I have to guarantee that it won't fail in future on some other file...

Don't you think that Manifold should try to index everything *even* when there are problems with indexing some documents?

I am just not sure if this is bug or feature... :)
____________________________________________________________
Electronic mail messages entering and leaving Arup  business
systems are scanned for acceptability of content and viruses


Re: question about error handling during indexing

Posted by Maciej Liżewski <ma...@gmail.com>.
Ok. thanks for explanation. ignoreTikaException should do the trick (I will
check that).


2012/9/10 Karl Wright <da...@gmail.com>

> Usually in these situations Solr returns a 500 error.  The Solr
> Connector, at one point, used to retry indefinitely when such an error
> came back, but I believe there were changes to this logic and now it
> may well abort the job if this happens for more than a few hours
> straight.  This is because the Solr connector has no way of knowing
> whether the 500 error is due to just a Tika exception on a single
> document, or something more fundamental being wrong with your Solr
> configuration.
>
> The big problem is that Solr should not be returning a 500 error just
> because Tika is unhappy with the document.  I believe there is a Solr
> ticket that describes the problem and requests different handling; you
> may be able to find it.
>
> Karl
>
>
> On Mon, Sep 10, 2012 at 9:47 AM, Maciej Liżewski
> <ma...@gmail.com> wrote:
> > Hi,
> >
> > I have found situation when Solr throws exception that it is not able to
> > parse specified file, like this:
> > INFO: [collection1] webapp=/solr path=/update/extract
> > params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id
> =file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
> > {} 0 269
> > 2012-09-10 15:34:50 org.apache.solr.common.SolrException log
> > SEVERE: org.apache.solr.common.SolrException:
> > org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException
> from
> > org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
> >         at
> >
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
> >         at
> >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> >         at
> >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
> >         at
> >
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
> >         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
> >
> > Now - I can live with that, I do not expect it to index everything, but I
> > am not sure if Manifold should react the way it is - it just stops
> indexing
> > anything more from such job (and in fact it shuts down job execution)
> where
> > it should try to index other pending files... Now I must run indexing by
> > hand, check if everything is ok, when there is such problem - add proper
> > "exclude" filter (which leads to Manifold does not index this kind of
> files
> > at all, but problem could be with only this specific single file), and
> run
> > it again. Still - I have to guarantee that it won't fail in future on
> some
> > other file...
> >
> > Don't you think that Manifold should try to index everything *even* when
> > there are problems with indexing some documents?
> >
> > I am just not sure if this is bug or feature... :)
>

Re: question about error handling during indexing

Posted by Karl Wright <da...@gmail.com>.
Usually in these situations Solr returns a 500 error.  The Solr
Connector, at one point, used to retry indefinitely when such an error
came back, but I believe there were changes to this logic and now it
may well abort the job if this happens for more than a few hours
straight.  This is because the Solr connector has no way of knowing
whether the 500 error is due to just a Tika exception on a single
document, or something more fundamental being wrong with your Solr
configuration.

The big problem is that Solr should not be returning a 500 error just
because Tika is unhappy with the document.  I believe there is a Solr
ticket that describes the problem and requests different handling; you
may be able to find it.

Karl


On Mon, Sep 10, 2012 at 9:47 AM, Maciej Liżewski
<ma...@gmail.com> wrote:
> Hi,
>
> I have found situation when Solr throws exception that it is not able to
> parse specified file, like this:
> INFO: [collection1] webapp=/solr path=/update/extract
> params={literal.deny_token_document=LDAPgroup:DEAD_AUTHORITY&literal.id=file://///XXXXX/YYYYmovie.mov&literal.allow_token_document=LDAPgroup:50071&literal.allow_token_document=LDAPgroup:group}
> {} 0 269
> 2012-09-10 15:34:50 org.apache.solr.common.SolrException log
> SEVERE: org.apache.solr.common.SolrException:
> org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from
> org.apache.tika.parser.mp4.MP4Parser@48f9a4c1
>         at
> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:230)
>         at
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>         at
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
>         at
> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:240)
>         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1656)
>
> Now - I can live with that, I do not expect it to index everything, but I
> am not sure if Manifold should react the way it is - it just stops indexing
> anything more from such job (and in fact it shuts down job execution) where
> it should try to index other pending files... Now I must run indexing by
> hand, check if everything is ok, when there is such problem - add proper
> "exclude" filter (which leads to Manifold does not index this kind of files
> at all, but problem could be with only this specific single file), and run
> it again. Still - I have to guarantee that it won't fail in future on some
> other file...
>
> Don't you think that Manifold should try to index everything *even* when
> there are problems with indexing some documents?
>
> I am just not sure if this is bug or feature... :)