You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2015/07/09 16:42:17 UTC

RE: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Hi Tilman,
  Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).

  If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.

  If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.

  The log message from 2.0.0 is:

>java -jar
 pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
s.pdf
Writing image: testPDF_childAttachments-1
Writing image: testPDF_childAttachments-2
Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
SEVERE: No ImageWriter found for 'tiff' format
Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG
Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman


-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Wednesday, July 08, 2015 2:59 AM
To: dev@pdfbox.apache.org
Subject: Re: migrating Tika to 2.0.0

Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>
> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.

You need to attach jai_imageio.jar to your build.

And also the levigo jbig2 plugin. Like in the 1.8 version.

https://pdfbox.apache.org/1.8/dependencies.html

If it still doesn't work, could you please post the log message?

Tilman

>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Tuesday, July 07, 2015 3:48 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>> Thank you, Andreas.  I opened PDFBox-2856.
>>
>> How about tiffs not being handled by ExtractImages...is this expected?
>>>      I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
> What tiff? When displaying it with Adobe Reader, I see a word file and a
> joboptions file.
>
> Tilman
>
>> Thank you, again.
>>
>> Best,
>>
>>             Tim
>> -----Original Message-----
>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>> Sent: Tuesday, July 07, 2015 3:08 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Hi,
>>
>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>> All,
>>>
>>>      As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>
>>>      I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>> 1.8.9 when extracting the text from the given pdf.
>>
>>>      Many apologies if this issue has already been identified.
>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>> reporting.
>>
>>>      I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>
>>>             Thank you!
>>>
>>>                  Best,
>>>
>>>                         Tim
>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> BR
>> Andreas
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Ha. Ok, got it.  We'll make extra sure to include the "take a look at potential non-Apache supporting libraries" in our release notes.

Thank you!

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, July 09, 2015 12:51 PM
To: dev@pdfbox.apache.org
Subject: Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Am 09.07.2015 um 18:35 schrieb Allison, Timothy B.:
>> Are the tiff files generated with 1.8.9 valid and really tiff files?
>> I.e. non empty, and start with "II" or "MM"?
> The file extracted with 1.8.9 app is non-empty, starts with II and when double clicked in windows opens up to show an image of a lightbulb.
> The file extracted with 2.0.0 app is empty.
>
>> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> Not sure how jai-imageio would sneak onto my classpath with this:
>
> java -jar pdfbox-app-1.8.9.jar ExtractImages testPDF_childAttachments.pdf
>
> but not with this:
>
> java -jar pdfbox-app-2.0.0-20150707.080851-1479.jar ExtractImages testPDF_childAttachments.pdf
Sorry, I didn't read your text properly. And I have a good answer too 
this time.

In 1.8.9 tiff files were written with write2OutputStream, which uses 
TiffWrapper, which creates a TIFF file itself. This was deleted in
https://issues.apache.org/jira/browse/PDFBOX-2653
https://svn.apache.org/viewvc?view=revision&revision=r1660716
(I proposed the deletion)

Tilman




>
>
> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, July 09, 2015 12:24 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages
>
> Am 09.07.2015 um 18:11 schrieb Allison, Timothy B.:
>>> That error comes when jai_imageio.jar is missing. Standard java can't
>>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>>> because we can't bundle jai_imageio.jar due to license issues.
>> Ok, but to confirm, it isn't the case that 1.8.9 app has jai_imageio.jar bundled but 2.0.0 doesn't?
>>
>> Prebuilt-1.8.9 app _does_ write to tiff, but prebuilt-2.0.0 app doesn't.
>>
>> What's the difference?
> Are the tiff files generated with 1.8.9 valid and really tiff files?
> I.e. non empty, and start with "II" or "MM"?
>
> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> Tilman
>
>
>>
>> Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
>>> Hi Tilman,
>>>      Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>>>
>>>      If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>>>
>>>      If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>>>
>>>      The log message from 2.0.0 is:
>>>
>>>> java -jar
>>>     pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
>>> s.pdf
>>> Writing image: testPDF_childAttachments-1
>>> Writing image: testPDF_childAttachments-2
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>>> SEVERE: No ImageWriter found for 'tiff' format
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>>> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG
>> That error comes when jai_imageio.jar is missing. Standard java can't
>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>> because we can't bundle jai_imageio.jar due to license issues.
>>
>> Tilman
>> Tilman
>>
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
>>> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>>>
>>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Wednesday, July 08, 2015 2:59 AM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>>>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>>>
>>>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
>>> You need to attach jai_imageio.jar to your build.
>>>
>>> And also the levigo jbig2 plugin. Like in the 1.8 version.
>>>
>>> https://pdfbox.apache.org/1.8/dependencies.html
>>>
>>> If it still doesn't work, could you please post the log message?
>>>
>>> Tilman
>>>
>>>> -----Original Message-----
>>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>>> Sent: Tuesday, July 07, 2015 3:48 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: migrating Tika to 2.0.0
>>>>
>>>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>>>
>>>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>>>         I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>>>> joboptions file.
>>>>
>>>> Tilman
>>>>
>>>>> Thank you, again.
>>>>>
>>>>> Best,
>>>>>
>>>>>                Tim
>>>>> -----Original Message-----
>>>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>>>> To: dev@pdfbox.apache.org
>>>>> Subject: Re: migrating Tika to 2.0.0
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>>>> All,
>>>>>>
>>>>>>         As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>>>
>>>>>>         I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>>>> 1.8.9 when extracting the text from the given pdf.
>>>>>
>>>>>>         Many apologies if this issue has already been identified.
>>>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>>>> reporting.
>>>>>
>>>>>>         I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>>>
>>>>>>                Thank you!
>>>>>>
>>>>>>                     Best,
>>>>>>
>>>>>>                            Tim
>>>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>> BR
>>>>> Andreas
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 09.07.2015 um 18:35 schrieb Allison, Timothy B.:
>> Are the tiff files generated with 1.8.9 valid and really tiff files?
>> I.e. non empty, and start with "II" or "MM"?
> The file extracted with 1.8.9 app is non-empty, starts with II and when double clicked in windows opens up to show an image of a lightbulb.
> The file extracted with 2.0.0 app is empty.
>
>> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> Not sure how jai-imageio would sneak onto my classpath with this:
>
> java -jar pdfbox-app-1.8.9.jar ExtractImages testPDF_childAttachments.pdf
>
> but not with this:
>
> java -jar pdfbox-app-2.0.0-20150707.080851-1479.jar ExtractImages testPDF_childAttachments.pdf
Sorry, I didn't read your text properly. And I have a good answer too 
this time.

In 1.8.9 tiff files were written with write2OutputStream, which uses 
TiffWrapper, which creates a TIFF file itself. This was deleted in
https://issues.apache.org/jira/browse/PDFBOX-2653
https://svn.apache.org/viewvc?view=revision&revision=r1660716
(I proposed the deletion)

Tilman




>
>
> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Thursday, July 09, 2015 12:24 PM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages
>
> Am 09.07.2015 um 18:11 schrieb Allison, Timothy B.:
>>> That error comes when jai_imageio.jar is missing. Standard java can't
>>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>>> because we can't bundle jai_imageio.jar due to license issues.
>> Ok, but to confirm, it isn't the case that 1.8.9 app has jai_imageio.jar bundled but 2.0.0 doesn't?
>>
>> Prebuilt-1.8.9 app _does_ write to tiff, but prebuilt-2.0.0 app doesn't.
>>
>> What's the difference?
> Are the tiff files generated with 1.8.9 valid and really tiff files?
> I.e. non empty, and start with "II" or "MM"?
>
> Could it be that jai_imageio is packaged somewhere else in your classpath?
>
> Tilman
>
>
>>
>> Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
>>> Hi Tilman,
>>>      Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>>>
>>>      If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>>>
>>>      If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>>>
>>>      The log message from 2.0.0 is:
>>>
>>>> java -jar
>>>     pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
>>> s.pdf
>>> Writing image: testPDF_childAttachments-1
>>> Writing image: testPDF_childAttachments-2
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>>> SEVERE: No ImageWriter found for 'tiff' format
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>>> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG
>> That error comes when jai_imageio.jar is missing. Standard java can't
>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>> because we can't bundle jai_imageio.jar due to license issues.
>>
>> Tilman
>> Tilman
>>
>>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
>>> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>>>
>>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Wednesday, July 08, 2015 2:59 AM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>>>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>>>
>>>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
>>> You need to attach jai_imageio.jar to your build.
>>>
>>> And also the levigo jbig2 plugin. Like in the 1.8 version.
>>>
>>> https://pdfbox.apache.org/1.8/dependencies.html
>>>
>>> If it still doesn't work, could you please post the log message?
>>>
>>> Tilman
>>>
>>>> -----Original Message-----
>>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>>> Sent: Tuesday, July 07, 2015 3:48 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: migrating Tika to 2.0.0
>>>>
>>>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>>>
>>>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>>>         I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>>>> joboptions file.
>>>>
>>>> Tilman
>>>>
>>>>> Thank you, again.
>>>>>
>>>>> Best,
>>>>>
>>>>>                Tim
>>>>> -----Original Message-----
>>>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>>>> To: dev@pdfbox.apache.org
>>>>> Subject: Re: migrating Tika to 2.0.0
>>>>>
>>>>> Hi,
>>>>>
>>>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>>>> All,
>>>>>>
>>>>>>         As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>>>
>>>>>>         I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>>>> 1.8.9 when extracting the text from the given pdf.
>>>>>
>>>>>>         Many apologies if this issue has already been identified.
>>>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>>>> reporting.
>>>>>
>>>>>>         I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>>>
>>>>>>                Thank you!
>>>>>>
>>>>>>                     Best,
>>>>>>
>>>>>>                            Tim
>>>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>> BR
>>>>> Andreas
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>Are the tiff files generated with 1.8.9 valid and really tiff files? 
>I.e. non empty, and start with "II" or "MM"?

The file extracted with 1.8.9 app is non-empty, starts with II and when double clicked in windows opens up to show an image of a lightbulb.
The file extracted with 2.0.0 app is empty.

> Could it be that jai_imageio is packaged somewhere else in your classpath?


Not sure how jai-imageio would sneak onto my classpath with this:

java -jar pdfbox-app-1.8.9.jar ExtractImages testPDF_childAttachments.pdf

but not with this:

java -jar pdfbox-app-2.0.0-20150707.080851-1479.jar ExtractImages testPDF_childAttachments.pdf



Could it be that jai_imageio is packaged somewhere else in your classpath?

-----Original Message-----
From: Tilman Hausherr [mailto:THausherr@t-online.de] 
Sent: Thursday, July 09, 2015 12:24 PM
To: dev@pdfbox.apache.org
Subject: Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Am 09.07.2015 um 18:11 schrieb Allison, Timothy B.:
>> That error comes when jai_imageio.jar is missing. Standard java can't
>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>> because we can't bundle jai_imageio.jar due to license issues.
> Ok, but to confirm, it isn't the case that 1.8.9 app has jai_imageio.jar bundled but 2.0.0 doesn't?
>
> Prebuilt-1.8.9 app _does_ write to tiff, but prebuilt-2.0.0 app doesn't.
>
> What's the difference?

Are the tiff files generated with 1.8.9 valid and really tiff files? 
I.e. non empty, and start with "II" or "MM"?

Could it be that jai_imageio is packaged somewhere else in your classpath?

Tilman


>
>
> Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
>> Hi Tilman,
>>     Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>>
>>     If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>>
>>     If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>>
>>     The log message from 2.0.0 is:
>>
>>> java -jar
>>    pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
>> s.pdf
>> Writing image: testPDF_childAttachments-1
>> Writing image: testPDF_childAttachments-2
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>> SEVERE: No ImageWriter found for 'tiff' format
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG
> That error comes when jai_imageio.jar is missing. Standard java can't
> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
> because we can't bundle jai_imageio.jar due to license issues.
>
> Tilman
> Tilman
>
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
>> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>>
>>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Wednesday, July 08, 2015 2:59 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>>
>>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
>> You need to attach jai_imageio.jar to your build.
>>
>> And also the levigo jbig2 plugin. Like in the 1.8 version.
>>
>> https://pdfbox.apache.org/1.8/dependencies.html
>>
>> If it still doesn't work, could you please post the log message?
>>
>> Tilman
>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, July 07, 2015 3:48 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>>
>>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>>        I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>>> joboptions file.
>>>
>>> Tilman
>>>
>>>> Thank you, again.
>>>>
>>>> Best,
>>>>
>>>>               Tim
>>>> -----Original Message-----
>>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: migrating Tika to 2.0.0
>>>>
>>>> Hi,
>>>>
>>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>>> All,
>>>>>
>>>>>        As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>>
>>>>>        I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>>> 1.8.9 when extracting the text from the given pdf.
>>>>
>>>>>        Many apologies if this issue has already been identified.
>>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>>> reporting.
>>>>
>>>>>        I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>>
>>>>>               Thank you!
>>>>>
>>>>>                    Best,
>>>>>
>>>>>                           Tim
>>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>> BR
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 09.07.2015 um 18:11 schrieb Allison, Timothy B.:
>> That error comes when jai_imageio.jar is missing. Standard java can't
>> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
>> because we can't bundle jai_imageio.jar due to license issues.
> Ok, but to confirm, it isn't the case that 1.8.9 app has jai_imageio.jar bundled but 2.0.0 doesn't?
>
> Prebuilt-1.8.9 app _does_ write to tiff, but prebuilt-2.0.0 app doesn't.
>
> What's the difference?

Are the tiff files generated with 1.8.9 valid and really tiff files? 
I.e. non empty, and start with "II" or "MM"?

Could it be that jai_imageio is packaged somewhere else in your classpath?

Tilman


>
>
> Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
>> Hi Tilman,
>>     Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>>
>>     If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>>
>>     If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>>
>>     The log message from 2.0.0 is:
>>
>>> java -jar
>>    pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
>> s.pdf
>> Writing image: testPDF_childAttachments-1
>> Writing image: testPDF_childAttachments-2
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>> SEVERE: No ImageWriter found for 'tiff' format
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
>> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG
> That error comes when jai_imageio.jar is missing. Standard java can't
> write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff,
> because we can't bundle jai_imageio.jar due to license issues.
>
> Tilman
> Tilman
>
>> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
>> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>>
>>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Wednesday, July 08, 2015 2:59 AM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>>
>>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
>> You need to attach jai_imageio.jar to your build.
>>
>> And also the levigo jbig2 plugin. Like in the 1.8 version.
>>
>> https://pdfbox.apache.org/1.8/dependencies.html
>>
>> If it still doesn't work, could you please post the log message?
>>
>> Tilman
>>
>>> -----Original Message-----
>>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>>> Sent: Tuesday, July 07, 2015 3:48 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>>
>>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>>        I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>>> joboptions file.
>>>
>>> Tilman
>>>
>>>> Thank you, again.
>>>>
>>>> Best,
>>>>
>>>>               Tim
>>>> -----Original Message-----
>>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>>> To: dev@pdfbox.apache.org
>>>> Subject: Re: migrating Tika to 2.0.0
>>>>
>>>> Hi,
>>>>
>>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>>> All,
>>>>>
>>>>>        As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>>
>>>>>        I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>>> 1.8.9 when extracting the text from the given pdf.
>>>>
>>>>>        Many apologies if this issue has already been identified.
>>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>>> reporting.
>>>>
>>>>>        I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>>
>>>>>               Thank you!
>>>>>
>>>>>                    Best,
>>>>>
>>>>>                           Tim
>>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>> BR
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

RE: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by "Allison, Timothy B." <ta...@mitre.org>.

>That error comes when jai_imageio.jar is missing. Standard java can't 
>write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff, 
>because we can't bundle jai_imageio.jar due to license issues.

Ok, but to confirm, it isn't the case that 1.8.9 app has jai_imageio.jar bundled but 2.0.0 doesn't?

Prebuilt-1.8.9 app _does_ write to tiff, but prebuilt-2.0.0 app doesn't.

What's the difference?


Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
> Hi Tilman,
>    Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>
>    If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>
>    If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>
>    The log message from 2.0.0 is:
>
>> java -jar
>   pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
> s.pdf
> Writing image: testPDF_childAttachments-1
> Writing image: testPDF_childAttachments-2
> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
> SEVERE: No ImageWriter found for 'tiff' format
> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG

That error comes when jai_imageio.jar is missing. Standard java can't 
write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff, 
because we can't bundle jai_imageio.jar due to license issues.

Tilman
Tilman

> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Wednesday, July 08, 2015 2:59 AM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>
>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
> You need to attach jai_imageio.jar to your build.
>
> And also the levigo jbig2 plugin. Like in the 1.8 version.
>
> https://pdfbox.apache.org/1.8/dependencies.html
>
> If it still doesn't work, could you please post the log message?
>
> Tilman
>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, July 07, 2015 3:48 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>
>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>       I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>> joboptions file.
>>
>> Tilman
>>
>>> Thank you, again.
>>>
>>> Best,
>>>
>>>              Tim
>>> -----Original Message-----
>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Hi,
>>>
>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>> All,
>>>>
>>>>       As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>
>>>>       I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>> 1.8.9 when extracting the text from the given pdf.
>>>
>>>>       Many apologies if this issue has already been identified.
>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>> reporting.
>>>
>>>>       I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>
>>>>              Thank you!
>>>>
>>>>                   Best,
>>>>
>>>>                          Tim
>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> BR
>>> Andreas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: migrating Tika to 2.0.0 -- tiff files in app's ExtractImages

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 09.07.2015 um 16:42 schrieb Allison, Timothy B.:
> Hi Tilman,
>    Thank you for your quick response.  I'm not sure this is an issue of dependencies (although thank you for that reminder!).
>
>    If I download a prebuilt pdfbox-app-1.8.9 and call ExtractImages, I get two image files, on jpg and one tiff...both are actual image files.  Nothing is logged to stdout.
>
>    If I download a prebuilt nightly build of pdfbox-app-2.0.0 and call ExtractImages, I get one actual image file for the jpeg but then an empty (zero byte) tiff file.
>
>    The log message from 2.0.0 is:
>
>> java -jar
>   pdfbox-app-2.0.0-20150709.140349-1486.jar ExtractImages testPDF_childAttachment
> s.pdf
> Writing image: testPDF_childAttachments-1
> Writing image: testPDF_childAttachments-2
> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
> SEVERE: No ImageWriter found for 'tiff' format
> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.tools.imageio.ImageIOUtil writeImage
> SEVERE: Supported formats: JPG jpg bmp BMP gif GIF WBMP png PNG wbmp jpeg JPEG

That error comes when jai_imageio.jar is missing. Standard java can't 
write to tiff. And yes, the prebuilt pdfbox-app can't write to tiff, 
because we can't bundle jai_imageio.jar due to license issues.

Tilman

> Jul 09, 2015 10:37:13 AM org.apache.pdfbox.pdmodel.font.PDSimpleFont toUnicode
> WARNING: No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
>
>
> -----Original Message-----
> From: Tilman Hausherr [mailto:THausherr@t-online.de]
> Sent: Wednesday, July 08, 2015 2:59 AM
> To: dev@pdfbox.apache.org
> Subject: Re: migrating Tika to 2.0.0
>
> Am 08.07.2015 um 04:19 schrieb Allison, Timothy B.:
>> There are two embedded/inline images (not regular attachments) that are processed by pdfbox app's ExtractImages.
>>
>> In 1.8.9, there's a tiff (lightbulb) and a jpeg (flag/fireworks).  With trunk, there is a log warning saying that tiff isn't supported and then an empty tiff file and a jpeg.
> You need to attach jai_imageio.jar to your build.
>
> And also the levigo jbig2 plugin. Like in the 1.8 version.
>
> https://pdfbox.apache.org/1.8/dependencies.html
>
> If it still doesn't work, could you please post the log message?
>
> Tilman
>
>> -----Original Message-----
>> From: Tilman Hausherr [mailto:THausherr@t-online.de]
>> Sent: Tuesday, July 07, 2015 3:48 PM
>> To: dev@pdfbox.apache.org
>> Subject: Re: migrating Tika to 2.0.0
>>
>> Am 07.07.2015 um 21:39 schrieb Allison, Timothy B.:
>>> Thank you, Andreas.  I opened PDFBox-2856.
>>>
>>> How about tiffs not being handled by ExtractImages...is this expected?
>>>>       I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>> What tiff? When displaying it with Adobe Reader, I see a word file and a
>> joboptions file.
>>
>> Tilman
>>
>>> Thank you, again.
>>>
>>> Best,
>>>
>>>              Tim
>>> -----Original Message-----
>>> From: Andreas Lehmkuehler [mailto:andreas@lehmi.de]
>>> Sent: Tuesday, July 07, 2015 3:08 PM
>>> To: dev@pdfbox.apache.org
>>> Subject: Re: migrating Tika to 2.0.0
>>>
>>> Hi,
>>>
>>> Am 07.07.2015 um 18:59 schrieb Allison, Timothy B.:
>>>> All,
>>>>
>>>>       As part of TIKA-1285, I updated Jeremy Anderson's original patch for our wrapper for PDFBox 2.0.0 on Tika.  I'm having some problems running the unit tests because at least one of our files [0] is causing hefty resource utilization, which sends my laptop into paging.  The parse does eventually stop, and content is extracted.
>>> What version of PDFBox are you using, I guess the lastest SNAPSHOT?
>>>
>>>>       I also tried this file outside of Tika and used the straight PDFBox-app ( both ExtractImages and ExtractText), and performance is also far, far slower when compared with 1.8.9.
>>> I ran some quick tests and I can confirm that 2.0.0 is 4-5 times slower than
>>> 1.8.9 when extracting the text from the given pdf.
>>>
>>>>       Many apologies if this issue has already been identified.
>>> AFAIK, it was unknown until now. Please create a JIRA ticket and thanks for
>>> reporting.
>>>
>>>>       I also noticed that the tiff file is no longer extracted (2.0.0 logger says tiff not handled, but a tiff is extracted with 1.8.9).  Is this expected?
>>>>
>>>>              Thank you!
>>>>
>>>>                   Best,
>>>>
>>>>                          Tim
>>>> [0] https://issues.apache.org/jira/secure/attachment/12743988/testPDF_childAttachments.pdf
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>> BR
>>> Andreas
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org