You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian McDowell <br...@gmail.com> on 2014/05/22 06:24:16 UTC

pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!

Re: iText hitting infinite loop - Was Re: pdfs

Posted by Erick Erickson <er...@gmail.com>.

Siegfried:

Thanks! That pretty well nails the issue as being in Tika, it's nice to
know!

Erick


On Mon, Jun 2, 2014 at 10:14 AM, Siegfried Goeschl <sg...@gmx.at> wrote:

> Hi folks,
>
> Brian was so kind and sent me the troublesome PDF document
>
> I gave it a try with PDFBox directly in order to extract the text (PDFBox
> is used by Tikka to extract the textual content of a PDF document)
>
> * hitting an infinite loop with PDFBox 1.8.3
> * no problems with PDFBox 1.8.4 & 1.8.5
> * PDFBox 1.8.4 is part of Apache Tika 1.5 (see http://www.apache.org/dist/
> tika/CHANGES-1.5.txt)
> * Apache SOLR 4.8 uses Tika 1.5 (see https://cwiki.apache.org/
> confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)
>
> In short the problem with this particular PDF is solved by
>
> * Apache PDFBox 1.8.4 onwards
> * Apache Tika 1.5
> * Apache SOLR 4.8
>
> Cheers,
>
> Siegfried Goeschl
>
>
>
> On 26.05.14 18:20, Erick Erickson wrote:
>
>> Brian:
>>
>> Yeah, if you can share the PDF that would be great. Parsing via Tika
>> should
>> not bring down Solr, although I supposed there could be something in Tika
>> that is pathologically bad.
>>
>> You could also try using Tika itself in SolrJ and indexing from a client.
>> That
>> might let you
>> 1> more gracefully handle this without shutting down Solr
>> 2> use different versions of Tika.
>>
>> Personally I like offloading the document parsing to clients anyway since
>> it
>> lessens the load on the Solr server and scales much better, but YMMV.
>>
>> It's not actually very difficult, here's a skeleton (rip out the DB parts)
>> http://searchhub.org/2012/02/14/indexing-with-solrj/
>>
>> Best,
>> Erick
>>
>> On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sg...@gmx.at>
>> wrote:
>>
>>> Sorry typo :- can you send me the PDF by email directly :-)
>>>
>>> Siegfried Goeschl
>>>
>>> On 25 May 2014, at 10:06, Siegfried Goeschl <sg...@gmx.at> wrote:
>>>
>>>  Hi Brian,
>>>>
>>>> can you send me the email? I would like to play around :-)
>>>>
>>>> Have you opened a JIRA for PdfBox? If not I willl open one if I can
>>>> reproduce the issue …
>>>>
>>>> Thanks in advance
>>>>
>>>> Siegfried Goeschl
>>>>
>>>>
>>>> On 25 May 2014, at 04:18, Brian McDowell <br...@gmail.com> wrote:
>>>>
>>>>  Our feeding (indexing) tool halts because Solr becomes unresponsive
>>>>> after
>>>>> getting some really bad pdfs. There are levels of pdf "badness." Some
>>>>> just
>>>>> will not parse and that's fine, but others are more problematic in
>>>>> that our
>>>>> Operations team has to restart Solr because it just hangs and accepts
>>>>> no
>>>>> more documents. I actually have identified a pdf that will bring down
>>>>> Solr
>>>>> every time. Does anyone think that doing pre-validation using the
>>>>> pdfbox
>>>>> jar will work? Or, will trying to validate just hang as well? Any help
>>>>> is
>>>>> appreciated.
>>>>>
>>>>>
>>>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <
>>>>> jack@basetechnology.com>wrote:
>>>>>
>>>>>  Yeah, I recall running into infinite loop issues with PDFBox in Solr
>>>>>> years
>>>>>> ago. They keep fixing these issues, but they keep popping up again.
>>>>>> Sigh.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> -----Original Message----- From: Siegfried Goeschl
>>>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Re: pdfs
>>>>>>
>>>>>>
>>>>>> Hi folks,
>>>>>>
>>>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>>>>
>>>>>> * memory consumption is an issue but can be handled
>>>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>>>> erratic since I was never able to track the problem back to a
>>>>>> particular
>>>>>> PDF document
>>>>>>
>>>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>>>> consumption goes through the roof
>>>>>>
>>>>>> If you doing really serious stuff I would recommend
>>>>>> * moving the document extraction stuff out of SOLR
>>>>>> * provide monitoring and recovery and stuck document extractions
>>>>>> ** killing worker threads
>>>>>> ** using external processed and kill them when spinning out of control
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Siegfried Goeschl
>>>>>>
>>>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>>>
>>>>>>  Yeah, PDF extraction has always been at least somewhat problematic.
>>>>>>> It
>>>>>>> has improved over the years, but still not likely to be perfect.
>>>>>>>
>>>>>>> That said, I'm not aware of any specific PDF extraction issue that
>>>>>>> would
>>>>>>> bring down Solr - as opposed to causing a 500 status with an
>>>>>>> exception
>>>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>>>> documents, especially those which are graphic-intense can require a
>>>>>>> lot
>>>>>>> of memory. The rest of Solr could be adversely affected if all
>>>>>>> available
>>>>>>> JVM heap is consumed. The solution is to give the JVM more heap
>>>>>>> space.
>>>>>>>
>>>>>>> So, what is your specific symptom?
>>>>>>>
>>>>>>> -- Jack Krupansky
>>>>>>>
>>>>>>> -----Original Message----- From: Brian McDowell
>>>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>>>> To: solr-user@lucene.apache.org
>>>>>>> Subject: pdfs
>>>>>>>
>>>>>>> Has anyone had issues with indexing pdf files? Some pdfs are
>>>>>>> bringing down
>>>>>>> Solr completely so that it actually needs to be manually restarted.
>>>>>>> We are
>>>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>>>> problem because the release notes associated with the new tika
>>>>>>> version and
>>>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work
>>>>>>> and now
>>>>>>> this issue is causing us to reevaluate using Solr. Any help on this
>>>>>>> matter
>>>>>>> would be greatly appreciated. Thank you!
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>
>>>
>

iText hitting infinite loop - Was Re: pdfs

Posted by Siegfried Goeschl <sg...@gmx.at>.

Hi folks,

Brian was so kind and sent me the troublesome PDF document

I gave it a try with PDFBox directly in order to extract the text 
(PDFBox is used by Tikka to extract the textual content of a PDF document)

* hitting an infinite loop with PDFBox 1.8.3
* no problems with PDFBox 1.8.4 & 1.8.5
* PDFBox 1.8.4 is part of Apache Tika 1.5 (see 
http://www.apache.org/dist/tika/CHANGES-1.5.txt)
* Apache SOLR 4.8 uses Tika 1.5 (see 
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika)

In short the problem with this particular PDF is solved by

* Apache PDFBox 1.8.4 onwards
* Apache Tika 1.5
* Apache SOLR 4.8

Cheers,

Siegfried Goeschl



On 26.05.14 18:20, Erick Erickson wrote:
> Brian:
>
> Yeah, if you can share the PDF that would be great. Parsing via Tika should
> not bring down Solr, although I supposed there could be something in Tika
> that is pathologically bad.
>
> You could also try using Tika itself in SolrJ and indexing from a client. That
> might let you
> 1> more gracefully handle this without shutting down Solr
> 2> use different versions of Tika.
>
> Personally I like offloading the document parsing to clients anyway since it
> lessens the load on the Solr server and scales much better, but YMMV.
>
> It's not actually very difficult, here's a skeleton (rip out the DB parts)
> http://searchhub.org/2012/02/14/indexing-with-solrj/
>
> Best,
> Erick
>
> On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sg...@gmx.at> wrote:
>> Sorry typo :- can you send me the PDF by email directly :-)
>>
>> Siegfried Goeschl
>>
>> On 25 May 2014, at 10:06, Siegfried Goeschl <sg...@gmx.at> wrote:
>>
>>> Hi Brian,
>>>
>>> can you send me the email? I would like to play around :-)
>>>
>>> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue …
>>>
>>> Thanks in advance
>>>
>>> Siegfried Goeschl
>>>
>>>
>>> On 25 May 2014, at 04:18, Brian McDowell <br...@gmail.com> wrote:
>>>
>>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>>> will not parse and that's fine, but others are more problematic in that our
>>>> Operations team has to restart Solr because it just hangs and accepts no
>>>> more documents. I actually have identified a pdf that will bring down Solr
>>>> every time. Does anyone think that doing pre-validation using the pdfbox
>>>> jar will work? Or, will trying to validate just hang as well? Any help is
>>>> appreciated.
>>>>
>>>>
>>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <ja...@basetechnology.com>wrote:
>>>>
>>>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Siegfried Goeschl
>>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: Re: pdfs
>>>>>
>>>>>
>>>>> Hi folks,
>>>>>
>>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>>>
>>>>> * memory consumption is an issue but can be handled
>>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>>> erratic since I was never able to track the problem back to a particular
>>>>> PDF document
>>>>>
>>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>>> consumption goes through the roof
>>>>>
>>>>> If you doing really serious stuff I would recommend
>>>>> * moving the document extraction stuff out of SOLR
>>>>> * provide monitoring and recovery and stuck document extractions
>>>>> ** killing worker threads
>>>>> ** using external processed and kill them when spinning out of control
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Siegfried Goeschl
>>>>>
>>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>>
>>>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>>>> has improved over the years, but still not likely to be perfect.
>>>>>>
>>>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>>> documents, especially those which are graphic-intense can require a lot
>>>>>> of memory. The rest of Solr could be adversely affected if all available
>>>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>>>>
>>>>>> So, what is your specific symptom?
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> -----Original Message----- From: Brian McDowell
>>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: pdfs
>>>>>>
>>>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>>>>> Solr completely so that it actually needs to be manually restarted. We are
>>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>>> problem because the release notes associated with the new tika version and
>>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>>>> would be greatly appreciated. Thank you!
>>>>>>
>>>>>
>>>>>
>>>
>>

Re: pdfs

Posted by Erick Erickson <er...@gmail.com>.

Brian:

Yeah, if you can share the PDF that would be great. Parsing via Tika should
not bring down Solr, although I supposed there could be something in Tika
that is pathologically bad.

You could also try using Tika itself in SolrJ and indexing from a client. That
might let you
1> more gracefully handle this without shutting down Solr
2> use different versions of Tika.

Personally I like offloading the document parsing to clients anyway since it
lessens the load on the Solr server and scales much better, but YMMV.

It's not actually very difficult, here's a skeleton (rip out the DB parts)
http://searchhub.org/2012/02/14/indexing-with-solrj/

Best,
Erick

On Sun, May 25, 2014 at 2:07 AM, Siegfried Goeschl <sg...@gmx.at> wrote:
> Sorry typo :- can you send me the PDF by email directly :-)
>
> Siegfried Goeschl
>
> On 25 May 2014, at 10:06, Siegfried Goeschl <sg...@gmx.at> wrote:
>
>> Hi Brian,
>>
>> can you send me the email? I would like to play around :-)
>>
>> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue …
>>
>> Thanks in advance
>>
>> Siegfried Goeschl
>>
>>
>> On 25 May 2014, at 04:18, Brian McDowell <br...@gmail.com> wrote:
>>
>>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>>> will not parse and that's fine, but others are more problematic in that our
>>> Operations team has to restart Solr because it just hangs and accepts no
>>> more documents. I actually have identified a pdf that will bring down Solr
>>> every time. Does anyone think that doing pre-validation using the pdfbox
>>> jar will work? Or, will trying to validate just hang as well? Any help is
>>> appreciated.
>>>
>>>
>>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <ja...@basetechnology.com>wrote:
>>>
>>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -----Original Message----- From: Siegfried Goeschl
>>>> Sent: Thursday, May 22, 2014 4:35 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: pdfs
>>>>
>>>>
>>>> Hi folks,
>>>>
>>>> for a small customer project I'm running SOLR with embedded Tikka.
>>>>
>>>> * memory consumption is an issue but can be handled
>>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>>> excessive CPU usage - requires SOLR restart but happens only once
>>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>>> erratic since I was never able to track the problem back to a particular
>>>> PDF document
>>>>
>>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>>> consumption goes through the roof
>>>>
>>>> If you doing really serious stuff I would recommend
>>>> * moving the document extraction stuff out of SOLR
>>>> * provide monitoring and recovery and stuck document extractions
>>>> ** killing worker threads
>>>> ** using external processed and kill them when spinning out of control
>>>>
>>>> Cheers,
>>>>
>>>> Siegfried Goeschl
>>>>
>>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>>>
>>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>>> has improved over the years, but still not likely to be perfect.
>>>>>
>>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>>> documents, especially those which are graphic-intense can require a lot
>>>>> of memory. The rest of Solr could be adversely affected if all available
>>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>>>
>>>>> So, what is your specific symptom?
>>>>>
>>>>> -- Jack Krupansky
>>>>>
>>>>> -----Original Message----- From: Brian McDowell
>>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>>> To: solr-user@lucene.apache.org
>>>>> Subject: pdfs
>>>>>
>>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>>>> Solr completely so that it actually needs to be manually restarted. We are
>>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>>> problem because the release notes associated with the new tika version and
>>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>>> would be greatly appreciated. Thank you!
>>>>>
>>>>
>>>>
>>
>

Re: pdfs

Posted by Siegfried Goeschl <sg...@gmx.at>.

Sorry typo :- can you send me the PDF by email directly :-)

Siegfried Goeschl

On 25 May 2014, at 10:06, Siegfried Goeschl <sg...@gmx.at> wrote:

> Hi Brian,
> 
> can you send me the email? I would like to play around :-)
> 
> Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue … 
> 
> Thanks in advance
> 
> Siegfried Goeschl
> 
> 
> On 25 May 2014, at 04:18, Brian McDowell <br...@gmail.com> wrote:
> 
>> Our feeding (indexing) tool halts because Solr becomes unresponsive after
>> getting some really bad pdfs. There are levels of pdf "badness." Some just
>> will not parse and that's fine, but others are more problematic in that our
>> Operations team has to restart Solr because it just hangs and accepts no
>> more documents. I actually have identified a pdf that will bring down Solr
>> every time. Does anyone think that doing pre-validation using the pdfbox
>> jar will work? Or, will trying to validate just hang as well? Any help is
>> appreciated.
>> 
>> 
>> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <ja...@basetechnology.com>wrote:
>> 
>>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Siegfried Goeschl
>>> Sent: Thursday, May 22, 2014 4:35 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: pdfs
>>> 
>>> 
>>> Hi folks,
>>> 
>>> for a small customer project I'm running SOLR with embedded Tikka.
>>> 
>>> * memory consumption is an issue but can be handled
>>> * there is an issue with PDFBox hitting an infinite loop which causes
>>> excessive CPU usage - requires SOLR restart but happens only once
>>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>>> erratic since I was never able to track the problem back to a particular
>>> PDF document
>>> 
>>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>>> consumption goes through the roof
>>> 
>>> If you doing really serious stuff I would recommend
>>> * moving the document extraction stuff out of SOLR
>>> * provide monitoring and recovery and stuck document extractions
>>> ** killing worker threads
>>> ** using external processed and kill them when spinning out of control
>>> 
>>> Cheers,
>>> 
>>> Siegfried Goeschl
>>> 
>>> On 22.05.14 06:46, Jack Krupansky wrote:
>>> 
>>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>>> has improved over the years, but still not likely to be perfect.
>>>> 
>>>> That said, I'm not aware of any specific PDF extraction issue that would
>>>> bring down Solr - as opposed to causing a 500 status with an exception
>>>> in PDF extraction, with the exception of memory usage. Some PDF
>>>> documents, especially those which are graphic-intense can require a lot
>>>> of memory. The rest of Solr could be adversely affected if all available
>>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>>> 
>>>> So, what is your specific symptom?
>>>> 
>>>> -- Jack Krupansky
>>>> 
>>>> -----Original Message----- From: Brian McDowell
>>>> Sent: Thursday, May 22, 2014 12:24 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: pdfs
>>>> 
>>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>>> Solr completely so that it actually needs to be manually restarted. We are
>>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>>> problem because the release notes associated with the new tika version and
>>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>>> would be greatly appreciated. Thank you!
>>>> 
>>> 
>>> 
>

Re: pdfs

Posted by Siegfried Goeschl <sg...@gmx.at>.

Hi Brian,

can you send me the email? I would like to play around :-)

Have you opened a JIRA for PdfBox? If not I willl open one if I can reproduce the issue … 

Thanks in advance

Siegfried Goeschl


On 25 May 2014, at 04:18, Brian McDowell <br...@gmail.com> wrote:

> Our feeding (indexing) tool halts because Solr becomes unresponsive after
> getting some really bad pdfs. There are levels of pdf "badness." Some just
> will not parse and that's fine, but others are more problematic in that our
> Operations team has to restart Solr because it just hangs and accepts no
> more documents. I actually have identified a pdf that will bring down Solr
> every time. Does anyone think that doing pre-validation using the pdfbox
> jar will work? Or, will trying to validate just hang as well? Any help is
> appreciated.
> 
> 
> On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <ja...@basetechnology.com>wrote:
> 
>> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
>> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Siegfried Goeschl
>> Sent: Thursday, May 22, 2014 4:35 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: pdfs
>> 
>> 
>> Hi folks,
>> 
>> for a small customer project I'm running SOLR with embedded Tikka.
>> 
>> * memory consumption is an issue but can be handled
>> * there is an issue with PDFBox hitting an infinite loop which causes
>> excessive CPU usage - requires SOLR restart but happens only once
>> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
>> erratic since I was never able to track the problem back to a particular
>> PDF document
>> 
>> Having said that we wire SOLR with Nagios to get an alarm when CPU
>> consumption goes through the roof
>> 
>> If you doing really serious stuff I would recommend
>> * moving the document extraction stuff out of SOLR
>> * provide monitoring and recovery and stuck document extractions
>> ** killing worker threads
>> ** using external processed and kill them when spinning out of control
>> 
>> Cheers,
>> 
>> Siegfried Goeschl
>> 
>> On 22.05.14 06:46, Jack Krupansky wrote:
>> 
>>> Yeah, PDF extraction has always been at least somewhat problematic. It
>>> has improved over the years, but still not likely to be perfect.
>>> 
>>> That said, I'm not aware of any specific PDF extraction issue that would
>>> bring down Solr - as opposed to causing a 500 status with an exception
>>> in PDF extraction, with the exception of memory usage. Some PDF
>>> documents, especially those which are graphic-intense can require a lot
>>> of memory. The rest of Solr could be adversely affected if all available
>>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>> 
>>> So, what is your specific symptom?
>>> 
>>> -- Jack Krupansky
>>> 
>>> -----Original Message----- From: Brian McDowell
>>> Sent: Thursday, May 22, 2014 12:24 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: pdfs
>>> 
>>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>>> Solr completely so that it actually needs to be manually restarted. We are
>>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>>> problem because the release notes associated with the new tika version and
>>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>>> this issue is causing us to reevaluate using Solr. Any help on this matter
>>> would be greatly appreciated. Thank you!
>>> 
>> 
>>

Re: pdfs

Posted by Brian McDowell <br...@gmail.com>.

Our feeding (indexing) tool halts because Solr becomes unresponsive after
getting some really bad pdfs. There are levels of pdf "badness." Some just
will not parse and that's fine, but others are more problematic in that our
Operations team has to restart Solr because it just hangs and accepts no
more documents. I actually have identified a pdf that will bring down Solr
every time. Does anyone think that doing pre-validation using the pdfbox
jar will work? Or, will trying to validate just hang as well? Any help is
appreciated.


On Thu, May 22, 2014 at 8:47 AM, Jack Krupansky <ja...@basetechnology.com>wrote:

> Yeah, I recall running into infinite loop issues with PDFBox in Solr years
> ago. They keep fixing these issues, but they keep popping up again. Sigh.
>
> -- Jack Krupansky
>
> -----Original Message----- From: Siegfried Goeschl
> Sent: Thursday, May 22, 2014 4:35 AM
> To: solr-user@lucene.apache.org
> Subject: Re: pdfs
>
>
> Hi folks,
>
> for a small customer project I'm running SOLR with embedded Tikka.
>
> * memory consumption is an issue but can be handled
> * there is an issue with PDFBox hitting an infinite loop which causes
> excessive CPU usage - requires SOLR restart but happens only once
> withing 400.000 documents (PDF, Word, ect) but is seems a little bit
> erratic since I was never able to track the problem back to a particular
> PDF document
>
> Having said that we wire SOLR with Nagios to get an alarm when CPU
> consumption goes through the roof
>
> If you doing really serious stuff I would recommend
> * moving the document extraction stuff out of SOLR
> * provide monitoring and recovery and stuck document extractions
> ** killing worker threads
> ** using external processed and kill them when spinning out of control
>
> Cheers,
>
> Siegfried Goeschl
>
> On 22.05.14 06:46, Jack Krupansky wrote:
>
>> Yeah, PDF extraction has always been at least somewhat problematic. It
>> has improved over the years, but still not likely to be perfect.
>>
>> That said, I'm not aware of any specific PDF extraction issue that would
>> bring down Solr - as opposed to causing a 500 status with an exception
>> in PDF extraction, with the exception of memory usage. Some PDF
>> documents, especially those which are graphic-intense can require a lot
>> of memory. The rest of Solr could be adversely affected if all available
>> JVM heap is consumed. The solution is to give the JVM more heap space.
>>
>> So, what is your specific symptom?
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Brian McDowell
>> Sent: Thursday, May 22, 2014 12:24 AM
>> To: solr-user@lucene.apache.org
>> Subject: pdfs
>>
>> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
>> Solr completely so that it actually needs to be manually restarted. We are
>> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
>> problem because the release notes associated with the new tika version and
>> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
>> this issue is causing us to reevaluate using Solr. Any help on this matter
>> would be greatly appreciated. Thank you!
>>
>
>

Re: pdfs

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yeah, I recall running into infinite loop issues with PDFBox in Solr years 
ago. They keep fixing these issues, but they keep popping up again. Sigh.

-- Jack Krupansky

-----Original Message----- 
From: Siegfried Goeschl
Sent: Thursday, May 22, 2014 4:35 AM
To: solr-user@lucene.apache.org
Subject: Re: pdfs

Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes
excessive CPU usage - requires SOLR restart but happens only once
withing 400.000 documents (PDF, Word, ect) but is seems a little bit
erratic since I was never able to track the problem back to a particular
PDF document

Having said that we wire SOLR with Nagios to get an alarm when CPU
consumption goes through the roof

If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control

Cheers,

Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:
> Yeah, PDF extraction has always been at least somewhat problematic. It
> has improved over the years, but still not likely to be perfect.
>
> That said, I'm not aware of any specific PDF extraction issue that would
> bring down Solr - as opposed to causing a 500 status with an exception
> in PDF extraction, with the exception of memory usage. Some PDF
> documents, especially those which are graphic-intense can require a lot
> of memory. The rest of Solr could be adversely affected if all available
> JVM heap is consumed. The solution is to give the JVM more heap space.
>
> So, what is your specific symptom?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Brian McDowell
> Sent: Thursday, May 22, 2014 12:24 AM
> To: solr-user@lucene.apache.org
> Subject: pdfs
>
> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
> Solr completely so that it actually needs to be manually restarted. We are
> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
> problem because the release notes associated with the new tika version and
> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
> this issue is causing us to reevaluate using Solr. Any help on this matter
> would be greatly appreciated. Thank you!

Re: pdfs

Posted by Siegfried Goeschl <sg...@gmx.at>.

Hi folks,

for a small customer project I'm running SOLR with embedded Tikka.

* memory consumption is an issue but can be handled
* there is an issue with PDFBox hitting an infinite loop which causes 
excessive CPU usage - requires SOLR restart but happens only once 
withing 400.000 documents (PDF, Word, ect) but is seems a little bit 
erratic since I was never able to track the problem back to a particular 
PDF document

Having said that we wire SOLR with Nagios to get an alarm when CPU 
consumption goes through the roof

If you doing really serious stuff I would recommend
* moving the document extraction stuff out of SOLR
* provide monitoring and recovery and stuck document extractions
** killing worker threads
** using external processed and kill them when spinning out of control

Cheers,

Siegfried Goeschl

On 22.05.14 06:46, Jack Krupansky wrote:
> Yeah, PDF extraction has always been at least somewhat problematic. It
> has improved over the years, but still not likely to be perfect.
>
> That said, I'm not aware of any specific PDF extraction issue that would
> bring down Solr - as opposed to causing a 500 status with an exception
> in PDF extraction, with the exception of memory usage. Some PDF
> documents, especially those which are graphic-intense can require a lot
> of memory. The rest of Solr could be adversely affected if all available
> JVM heap is consumed. The solution is to give the JVM more heap space.
>
> So, what is your specific symptom?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Brian McDowell
> Sent: Thursday, May 22, 2014 12:24 AM
> To: solr-user@lucene.apache.org
> Subject: pdfs
>
> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
> Solr completely so that it actually needs to be manually restarted. We are
> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
> problem because the release notes associated with the new tika version and
> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
> this issue is causing us to reevaluate using Solr. Any help on this matter
> would be greatly appreciated. Thank you!

Re: pdfs

Posted by Jack Krupansky <ja...@basetechnology.com>.

Yeah, PDF extraction has always been at least somewhat problematic. It has 
improved over the years, but still not likely to be perfect.

That said, I'm not aware of any specific PDF extraction issue that would 
bring down Solr - as opposed to causing a 500 status with an exception in 
PDF extraction, with the exception of memory usage. Some PDF documents, 
especially those which are graphic-intense can require a lot of memory. The 
rest of Solr could be adversely affected if all available JVM heap is 
consumed. The solution is to give the JVM more heap space.

So, what is your specific symptom?

-- Jack Krupansky

-----Original Message----- 
From: Brian McDowell
Sent: Thursday, May 22, 2014 12:24 AM
To: solr-user@lucene.apache.org
Subject: pdfs

Has anyone had issues with indexing pdf files? Some pdfs are bringing down
Solr completely so that it actually needs to be manually restarted. We are
using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
problem because the release notes associated with the new tika version and
also the new pdfbox indicate fixes for pdf issues. It didn't work and now
this issue is causing us to reevaluate using Solr. Any help on this matter
would be greatly appreciated. Thank you!

Re: pdfs

Posted by Alexandre Rafalovitch <ar...@gmail.com>.

Run Tika in a client instead? Or as a standalone server listening over
TCP socket). Ship only extractions to Solr. This is more efficient as
well.

I suspect, there would always be PDFs that cause strange behaviour,
even if just based on memory requirements (e.g. embedded images). If
that becomes a real issue, move that portion out of the critical path
(Solr Server).

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency

On Thu, May 22, 2014 at 11:24 AM, Brian McDowell <br...@gmail.com> wrote:
> Has anyone had issues with indexing pdf files? Some pdfs are bringing down
> Solr completely so that it actually needs to be manually restarted. We are
> using Solr 4.4 and thought that upgrading to Solr 4.8 would solve the
> problem because the release notes associated with the new tika version and
> also the new pdfbox indicate fixes for pdf issues. It didn't work and now
> this issue is causing us to reevaluate using Solr. Any help on this matter
> would be greatly appreciated. Thank you!