You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/07 18:59:22 UTC
migrating Tika to 2.0.0
All,
As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
Many apologies if this issue has already been identified.
I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
Thank you!
Best,
Tim
[0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
Re: migrating Tika to 2.0.0
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>
> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks). With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
You need to attach jai_imageio.jar to your build.
And also the levigo jbig2 plugin. Like in the 1.8 version.
https://pdfbox.apache.org/1.8/dependencies.html
If it still doesn't work, could you please post the log message?
Tilman
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, July 07, 2015 3:48 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>> Thank you, Andreas. I opened PDFBox-2856.
>>
>> How about tiffs not being handled by ExtractImages...is this expected?
>>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
> What tiff? When displaying it with Adobe Reader, I see a word file and a
> joboptions file.
>
> Tilman
>
>> Thank you, again.
>>
>> Best,
>>
>> Tim
>> -----Original Message-----
>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>> Sent: Tuesday, July 07, 2015 3:08 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Hi,
>>
>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>> All,
>>>
>>> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>
>>> I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>> 1.8.9 when extracting the text from the given pdf.
>>
>>> Many apologies if this issue has already been identified.
>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>> reporting.
>>
>>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
>>>
>>> Thank you!
>>>
>>> Best,
>>>
>>> Tim
>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> BR
>> Andreas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
RE: migrating Tika to 2.0.0
Posted by "Allison, Timothy B." <ta...@mitre.org>.
There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks). With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de]
Sent: Tuesday, July 07, 2015 3:48 PM
To: dev@pdfbox.apache.org
Subject: Re: migrating Tika to 2.0.0
Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
> Thank you, Andreas. I opened PDFBox-2856.
>
> How about tiffs not being handled by ExtractImages...is this expected?
>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
What tiff? When displaying it with Adobe Reader, I see a word file and a
joboptions file.
Tilman
> Thank you, again.
>
> Best,
>
> Tim
> -----Original Message-----
> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
> Sent: Tuesday, July 07, 2015 3:08 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Hi,
>
> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>> All,
>>
>> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>
>> I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
> 1.8.9 when extracting the text from the given pdf.
>
>> Many apologies if this issue has already been identified.
> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
> reporting.
>
>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
>>
>> Thank you!
>>
>> Best,
>>
>> Tim
>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> BR
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: migrating Tika to 2.0.0
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
> Thank you, Andreas. I opened PDFBox-2856.
>
> How about tiffs not being handled by ExtractImages...is this expected?
>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
What tiff? When displaying it with Adobe Reader, I see a word file and a
joboptions file.
Tilman
> Thank you, again.
>
> Best,
>
> Tim
> -----Original Message-----
> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
> Sent: Tuesday, July 07, 2015 3:08 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Hi,
>
> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>> All,
>>
>> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>
>> I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
> 1.8.9 when extracting the text from the given pdf.
>
>> Many apologies if this issue has already been identified.
> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
> reporting.
>
>> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
>>
>> Thank you!
>>
>> Best,
>>
>> Tim
>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> BR
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
RE: migrating Tika to 2.0.0
Posted by "Allison, Timothy B." <ta...@mitre.org>.
Thank you, Andreas. I opened PDFBox-2856.
How about tiffs not being handled by ExtractImages...is this expected?
> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
Thank you, again.
Best,
Tim
-----Original Message-----
From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
Sent: Tuesday, July 07, 2015 3:08 PM
To: dev@pdfbox.apache.org
Subject: Re: migrating Tika to 2.0.0
Hi,
Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
> All,
>
> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
What version of PDFBox are you using, I guess the lastest SNAPSHOT?
> I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
1.8.9 when extracting the text from the given pdf.
> Many apologies if this issue has already been identified.
AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
reporting.
>
> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
>
> Thank you!
>
> Best,
>
> Tim
> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: migrating Tika to 2.0.0
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,
Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
> All,
>
> As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika. I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging. The parse does eventually stop, and content is extracted.
What version of PDFBox are you using, I guess the lastest SNAPSHOT?
> I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
1.8.9 when extracting the text from the given pdf.
> Many apologies if this issue has already been identified.
AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
reporting.
>
> I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9). Is this expected?
>
> Thank you!
>
> Best,
>
> Tim
> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
BR
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org