You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Lachezar Dobrev <l....@gmail.com> on 2017/11/06 15:12:39 UTC

Re: Detecting if PDF contains only/mostly images.

  Well… It worked (mostly) as expected.
  The thing I did not expect is that a fraction of the the scanners
used turned out to be "smart"-ish. They attempt to perform OCR on the
scanned documents/images. They're actually doing a somewhat decent job
(I was impressed). The process however seems to result in a weird PDFs
that contains multiple layers of images stacked on top of each other
and text (where it was detected) that is stacked on top of the
graphics, and is *transparent* with *transparent* background (as far
as I understand), which is obviously invisible, but can be
select-copy-pasted, which is really nice.
  However that makes my job that much harder, since now bits and
pieces of the image are in different layers, and there *is* text
content.

  For the time being I am handling these by rendering the page to a
BufferedImage and then using manual ImageIO to render the page as a
Jpeg. The process seems to be very inefficient, a 124 KByte PDF file
ends up being converted to a 927 KByte Jpeg image (Java Image IO @ 90%
quality). I have asked my colleagues to scan a test page that is
suitable for sharing (limited personal information), I'm open for
sharing method suggestions.

  So I'm looking for ways to improve. Is there any way I can:
  * Detect and skip text when it's transparent (PDFTextStripper)
  * Render the page to a BufferedImage, but detect the density from
the images in the page without the need to guess (currently guess-set
to 3*72 = 216 ppi).
  * Detect and possibly use colour space from the embedded images (to
skip colour for black-grey-white images)
  * (please suggest other items I may have overlooked)


2017-10-31 12:23 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
> Heh heh... It's rather the opposite... it's a java library and the command
> line tools are for convenience :-)
>
> Tilman
>
>
> Am 31.10.2017 um 11:18 schrieb Lachezar Dobrev:
>>
>>    Ahh... You mean use the tool as a *ahm* tool?
>>    I'm so used to seeing these as parts of the command-line tools that
>> I've totally forgotten that their inner elements are suitable for use
>> in code. Thanks.
>>
>>    I think I'm going to create a Writer implementation that throws
>> exception if non-white space is written to it, and use the
>> writeText(PDDocument,Writer) to quickly cancel processing when
>> non-white space is found.
>>
>> 2017-10-30 19:54 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
>>>
>>> Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev:
>>>>
>>>>     I have been looking at it. I am actually using (a similar) approach
>>>> to read embedded bar-codes, but there I can test all images.
>>>>     The best I can see in ExtractImages is a way to check if there is
>>>> only one image. However I can not check if there is additional text or
>>>> other content, so that I do not mistakenly skip a page that has a
>>>> single logo (for instance) and lots of other text information.
>>>>     I tried looking at PDFTextStripper, but that is hard to follow.
>>>
>>>
>>> That one is easy... just create the object, set start and end page, and
>>> then
>>> call getText().
>>>
>>> Tilman
>>>
>>>
>>>>     Is there any sure(-ish) sign that there is text on a page that I can
>>>> use? Can I check for the existence of something that would tell me
>>>> that there is additional content on the page other than the single
>>>> image?
>>>>
>>>> 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
>>>>>
>>>>> Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev:
>>>>>>
>>>>>>      I have to process PDF files, that (supposedly) contain one big
>>>>>> image
>>>>>> per page, which is a result from a Document-Scanner. I'd like to avoid
>>>>>> performing PDF-To-Image in these cases, and use the underlying image
>>>>>> instead.
>>>>>>      I am not well-versed in all things PDF and have no idea how to
>>>>>> detect if a page has content other than a single image.
>>>>>>      Please advise.
>>>>>
>>>>>
>>>>> Please have a look at the ExtractImages.java source code. You can
>>>>> change
>>>>> that one to your needs.
>>>>>
>>>>> Tilman
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org

Re: Detecting if PDF contains only/mostly images.

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 06.11.2017 um 16:12 schrieb Lachezar Dobrev:
>    Well… It worked (mostly) as expected.
>    The thing I did not expect is that a fraction of the the scanners
> used turned out to be "smart"-ish. They attempt to perform OCR on the
> scanned documents/images. They're actually doing a somewhat decent job
> (I was impressed). The process however seems to result in a weird PDFs
> that contains multiple layers of images stacked on top of each other
> and text (where it was detected) that is stacked on top of the
> graphics, and is *transparent* with *transparent* background (as far
> as I understand), which is obviously invisible, but can be
> select-copy-pasted, which is really nice.
>    However that makes my job that much harder, since now bits and
> pieces of the image are in different layers, and there *is* text
> content.
>
>    For the time being I am handling these by rendering the page to a
> BufferedImage and then using manual ImageIO to render the page as a
> Jpeg. The process seems to be very inefficient, a 124 KByte PDF file
> ends up being converted to a 927 KByte Jpeg image (Java Image IO @ 90%

Save as PNG or (if b/w) as TIF.


> quality). I have asked my colleagues to scan a test page that is
> suitable for sharing (limited personal information), I'm open for
> sharing method suggestions.
>
>    So I'm looking for ways to improve. Is there any way I can:
>    * Detect and skip text when it's transparent (PDFTextStripper)

tricky... you'd have to detect whether the font is invisible, or whether 
it uses text rendering mode 3, or the color of the background.

>    * Render the page to a BufferedImage, but detect the density from
> the images in the page without the need to guess (currently guess-set
> to 3*72 = 216 ppi).
>    * Detect and possibly use colour space from the embedded images (to
> skip colour for black-grey-white images)
>    * (please suggest other items I may have overlooked)

Don't know... I think you can't win.

Tilman


>
>
> 2017-10-31 12:23 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
>> Heh heh... It's rather the opposite... it's a java library and the command
>> line tools are for convenience :-)
>>
>> Tilman
>>
>>
>> Am 31.10.2017 um 11:18 schrieb Lachezar Dobrev:
>>>     Ahh... You mean use the tool as a *ahm* tool?
>>>     I'm so used to seeing these as parts of the command-line tools that
>>> I've totally forgotten that their inner elements are suitable for use
>>> in code. Thanks.
>>>
>>>     I think I'm going to create a Writer implementation that throws
>>> exception if non-white space is written to it, and use the
>>> writeText(PDDocument,Writer) to quickly cancel processing when
>>> non-white space is found.
>>>
>>> 2017-10-30 19:54 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
>>>> Am 30.10.2017 um 16:52 schrieb Lachezar Dobrev:
>>>>>      I have been looking at it. I am actually using (a similar) approach
>>>>> to read embedded bar-codes, but there I can test all images.
>>>>>      The best I can see in ExtractImages is a way to check if there is
>>>>> only one image. However I can not check if there is additional text or
>>>>> other content, so that I do not mistakenly skip a page that has a
>>>>> single logo (for instance) and lots of other text information.
>>>>>      I tried looking at PDFTextStripper, but that is hard to follow.
>>>>
>>>> That one is easy... just create the object, set start and end page, and
>>>> then
>>>> call getText().
>>>>
>>>> Tilman
>>>>
>>>>
>>>>>      Is there any sure(-ish) sign that there is text on a page that I can
>>>>> use? Can I check for the existence of something that would tell me
>>>>> that there is additional content on the page other than the single
>>>>> image?
>>>>>
>>>>> 2017-10-30 15:53 GMT+02:00 Tilman Hausherr <TH...@t-online.de>:
>>>>>> Am 30.10.2017 um 14:04 schrieb Lachezar Dobrev:
>>>>>>>       I have to process PDF files, that (supposedly) contain one big
>>>>>>> image
>>>>>>> per page, which is a result from a Document-Scanner. I'd like to avoid
>>>>>>> performing PDF-To-Image in these cases, and use the underlying image
>>>>>>> instead.
>>>>>>>       I am not well-versed in all things PDF and have no idea how to
>>>>>>> detect if a page has content other than a single image.
>>>>>>>       Please advise.
>>>>>>
>>>>>> Please have a look at the ExtractImages.java source code. You can
>>>>>> change
>>>>>> that one to your needs.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org