You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/12/21 14:51:27 UTC

Garbage in PDF files

Running the attached PDF file through TIKA, I get a lot of garbage in the output (see txt file).  Far more than can be explained by the unmapped characters.  Where is this coming from?

If I take the PDF and flatten it by 'printing' to a PDF file, the garbage goes away

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

Re: Garbage in PDF files

Posted by Joe Wicentowski <jo...@gmail.com>.

Peter,

Reading Mr. McFly's very honest answers in this SF86, I've not gotten such
a laugh out of a mailing list post in quite some time.

Joe

On Tue, Dec 21, 2021 at 9:52 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> Running the attached PDF file through TIKA, I get a lot of garbage in the
> output (see txt file).  Far more than can be explained by the unmapped
> characters.  Where is this coming from?
>
>
>
> If I take the PDF and flatten it by ‘printing’ to a PDF file, the garbage
> goes away
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>

Re: Garbage in PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 21.12.2021 um 17:16 schrieb Peter Kronenberg:
>
> So are all the ‘garbage’ characters I’m seeing simply due to unmapped 
> characters?
>
Yes


> Is there any solution to make it look prettier?
>
No


> Or is ‘flattening’ (if that’s the correct word), the best solution?
>
Yes or OCR or using a dictionary to disregard trash. IIRC Tika has an 
option to use OCR and compare whether it is better.

Tilman


> *Peter Kronenberg****| **Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623 *
>
> Torch AI <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
> *From:* Tilman Hausherr <TH...@t-online.de>
> *Sent:* Tuesday, December 21, 2021 10:06 AM
> *To:* user@tika.apache.org
> *Subject:* Re: Garbage in PDF files
>
> Am 21.12.2021 um 15:51 schrieb Peter Kronenberg:
>
>     Running the attached PDF file through TIKA, I get a lot of garbage
>     in the output (see txt file).  Far more than can be explained by
>     the unmapped characters.  Where is this coming from?
>
> After the character is found to be unmapped, PDFBox tries a backup 
> strategy, which is obviously not successful here.
>
>         // when there is no Unicode mapping available, Acrobat simply 
> coerces the character code
>         // into Unicode, so we do the same. Subclasses of 
> PDFStreamEngine don't necessarily want
>         // this, which is why we leave it until this point in 
> PDFTextStreamEngine.
>         if (unicode == null)
>         {
>             if (font instanceof PDSimpleFont)
>             {
>                 char c = (char) code;
>                 unicode = new String(new char[] { c });
>             }
>             else
>             {
>                 // Acrobat doesn't seem to coerce composite font's 
> character codes, instead it
>                 // skips them. See the "allah2.pdf" TestTextStripper file.
>                 return;
>             }
>         }
>
> Adobe Reader has also trash.
>
> I can't comment whether it is "far more than expected", this would 
> require to count and make comparisons.
>
>     If I take the PDF and flatten it by ‘printing’ to a PDF file, the
>     garbage goes away
>
> Printing is probably converting it all to raster graphics or vector 
> graphics.
>
> Tilman
>

RE: Garbage in PDF files

Posted by Peter Kronenberg <pe...@torch.ai>.

So are all the ‘garbage’ characters I’m seeing simply due to unmapped characters?  Is there any solution to make it look prettier?  Or is ‘flattening’ (if that’s the correct word), the best solution?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tilman Hausherr <TH...@t-online.de>
Sent: Tuesday, December 21, 2021 10:06 AM
To: user@tika.apache.org
Subject: Re: Garbage in PDF files


Am 21.12.2021 um 15:51 schrieb Peter Kronenberg:
Running the attached PDF file through TIKA, I get a lot of garbage in the output (see txt file).  Far more than can be explained by the unmapped characters.  Where is this coming from?

After the character is found to be unmapped, PDFBox tries a backup strategy, which is obviously not successful here.

        // when there is no Unicode mapping available, Acrobat simply coerces the character code
        // into Unicode, so we do the same. Subclasses of PDFStreamEngine don't necessarily want
        // this, which is why we leave it until this point in PDFTextStreamEngine.
        if (unicode == null)
        {
            if (font instanceof PDSimpleFont)
            {
                char c = (char) code;
                unicode = new String(new char[] { c });
            }
            else
            {
                // Acrobat doesn't seem to coerce composite font's character codes, instead it
                // skips them. See the "allah2.pdf" TestTextStripper file.
                return;
            }
        }

Adobe Reader has also trash.

I can't comment whether it is "far more than expected", this would require to count and make comparisons.



If I take the PDF and flatten it by ‘printing’ to a PDF file, the garbage goes away

Printing is probably converting it all to raster graphics or vector graphics.

Tilman

Re: Garbage in PDF files

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 21.12.2021 um 15:51 schrieb Peter Kronenberg:
>
> Running the attached PDF file through TIKA, I get a lot of garbage in 
> the output (see txt file).  Far more than can be explained by the 
> unmapped characters.  Where is this coming from?
>
After the character is found to be unmapped, PDFBox tries a backup 
strategy, which is obviously not successful here.

         // when there is no Unicode mapping available, Acrobat simply 
coerces the character code
         // into Unicode, so we do the same. Subclasses of 
PDFStreamEngine don't necessarily want
         // this, which is why we leave it until this point in 
PDFTextStreamEngine.
         if (unicode == null)
         {
             if (font instanceof PDSimpleFont)
             {
                 char c = (char) code;
                 unicode = new String(new char[] { c });
             }
             else
             {
                 // Acrobat doesn't seem to coerce composite font's 
character codes, instead it
                 // skips them. See the "allah2.pdf" TestTextStripper file.
                 return;
             }
         }

Adobe Reader has also trash.

I can't comment whether it is "far more than expected", this would 
require to count and make comparisons.


> If I take the PDF and flatten it by ‘printing’ to a PDF file, the 
> garbage goes away
>
Printing is probably converting it all to raster graphics or vector 
graphics.

Tilman