You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@manifoldcf.apache.org by Karl Wright <da...@gmail.com> on 2018/07/27 10:46:55 UTC

Tika/POI bugs

Hi all,

I've easily spent 40 hours over the last two weeks chasing down bugs in
Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage
of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due
to yet).

I don't have enough time to create tickets directly in Tika for all
possible documents where these failures occur, so I urge our users to
create tickets DIRECTLY in the Tika project in Jira.  I guess you can let
the Tika people create the POI tickets, if need be.  For OutOfMemory
problems, please attach the file that causes the problem to the ticket, and
also the amount of memory you gave the agents process.  For ClassNotFound
problems, also include the stack trace.

Thanks in advance,
Karl

Re: Tika/POI bugs

Posted by Karl Wright <da...@gmail.com>.
To solve your production problem I highly recommend limiting the size of
the docs fed to Tika, for a start.  But that is no guarantee, I understand.

Out of memory problems are very hard to get good forensics for because they
cause major disruptions to the running server.  You could turn on a degree
of logging so that you can see what documents are being processed at any
time by all threads, but that is pretty verbose.  In your properties.xml
file, add <property name="org.apache.manifoldcf.crawlerthreads"
value="DEBUG"/>.  But I suspect that will generate far too much noise.
Still, it's the best I can offer.

Karl


On Fri, Jul 27, 2018 at 7:52 AM msaunier <ms...@citya.com> wrote:

> Hi Karl,
>
>
>
> Okay. For the Out of Memory:
>
>
>
> This is the last day that I can go on to find out where the error comes
> from. After that, I should go into production to meet my deadlines.
>
> I hope to find time in the future to be able to fix this problem on this
> server, otherwise I could not index it. Unfortunately, it is very difficult
> to find the documents that cause this error. I did not find any trace in
> the database. Even in debug mode, it is difficult to find the problematic
> document. Maybe if I limit to 1 thread I could find it more easily, but I'm
> afraid the crawl is very long.
>
> Maybe you have an idea of ​​the best method to adopt to find this / these
> documents?
>
>
>
> Maxence
>
>
>
> *De :* Karl Wright [mailto:daddywri@gmail.com]
> *Envoyé :* vendredi 27 juillet 2018 12:47
> *À :* dev <de...@manifoldcf.apache.org>; user@manifoldcf.apache.org
> *Objet :* Tika/POI bugs
>
>
>
> Hi all,
>
>
>
> I've easily spent 40 hours over the last two weeks chasing down bugs in
> Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage
> of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due
> to yet).
>
> I don't have enough time to create tickets directly in Tika for all
> possible documents where these failures occur, so I urge our users to
> create tickets DIRECTLY in the Tika project in Jira.  I guess you can let
> the Tika people create the POI tickets, if need be.  For OutOfMemory
> problems, please attach the file that causes the problem to the ticket, and
> also the amount of memory you gave the agents process.  For ClassNotFound
> problems, also include the stack trace.
>
>
>
> Thanks in advance,
>
> Karlx
>

RE: Tika/POI bugs

Posted by msaunier <ms...@citya.com>.
Hi Karl,

 

Okay. For the Out of Memory:

 

This is the last day that I can go on to find out where the error comes from. After that, I should go into production to meet my deadlines.

I hope to find time in the future to be able to fix this problem on this server, otherwise I could not index it. Unfortunately, it is very difficult to find the documents that cause this error. I did not find any trace in the database. Even in debug mode, it is difficult to find the problematic document. Maybe if I limit to 1 thread I could find it more easily, but I'm afraid the crawl is very long.

Maybe you have an idea of ​​the best method to adopt to find this / these documents?

 

Maxence

 

De : Karl Wright [mailto:daddywri@gmail.com] 
Envoyé : vendredi 27 juillet 2018 12:47
À : dev <de...@manifoldcf.apache.org>; user@manifoldcf.apache.org
Objet : Tika/POI bugs

 

Hi all,

 

I've easily spent 40 hours over the last two weeks chasing down bugs in Apache Tika and POI.  The two kinds I see are "ClassNotFound" (due to usage of the wrong ClassLoader), and "OutOfMemoryError" (not clear what it is due to yet).

I don't have enough time to create tickets directly in Tika for all possible documents where these failures occur, so I urge our users to create tickets DIRECTLY in the Tika project in Jira.  I guess you can let the Tika people create the POI tickets, if need be.  For OutOfMemory problems, please attach the file that causes the problem to the ticket, and also the amount of memory you gave the agents process.  For ClassNotFound problems, also include the stack trace.

 

Thanks in advance,

Karlx