You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Tilman Hausherr (JIRA)" <ji...@apache.org> on 2013/08/06 11:34:47 UTC

[jira] [Created] (PDFBOX-1680) PDFTextStripper returns garbage characters

Tilman Hausherr created PDFBOX-1680:
---------------------------------------

             Summary: PDFTextStripper returns garbage characters
                 Key: PDFBOX-1680
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1680
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.0
         Environment: XP
            Reporter: Tilman Hausherr
         Attachments: steveDoc.pdf

This code
    PDDocument document = PDDocument.loadNonSeq(new File(pdfFilename), null);
    PDFTextStripper pdfTextStripper = new PDFTextStripper("UTF-8");
    pdfTextStripper.setStartPage(1);
    pdfTextStripper.setEndPage(999);
    System.out.println(pdfTextStripper.getText(document));
returns this text when used with the file mentioned in PDFBOX-1436 :
===
Downloads Stack
Welcome to Mac OS X Snow Leopard.
The Dock in Snow Leopard 
includes Stacks, which you 
can use to quickly access 
MYLX\LU[S`\ZLKÄSLZHUK
applications right from  
the Dock. 
Stacks are simple to create. Just drag any folder to  
the right side of the Dock and it becomes a stack.  
Click a stack and it springs from the Dock in either  
HMHUVYHNYPK;VVWLUHÄSLPUHZ[HJRJSPJR[OL 
ÄSLVUJL
Mac OS X Snow Leopard includes three premade 
stacks called Documents, Downloads, and Applications 
@V\VWLULK[OPZÄSLMYVT[OL+V^USVHKZZ[HJR
The Downloads stack captures all of your Internet 
downloads and puts them in one convenient location. 
Files you download in Safari, Mail, and iChat go 
YPNO[PU[V[OL+V^USVHKZZ[HJR>OLUHÄSLÄUPZOLZ
KV^USVHKPUN[OLZ[HJRUV[PÄLZ`V\I`IV\UJPUNHUK
W\[Z[OLUL^ÄSLYPNO[VU[VWZVP[»ZLHZ`[VÄUK
Stacks automatically display their contents in a fan or a 
grid based on the number of items in the stack. You 
can also view the stack as a list. If you prefer one style 
over the other, you can set the stack to always open in 
that style.
:[HJRZPU[LSSPNLU[S`ZOV^[OLTVZ[YLSL]HU[P[LTZÄYZ[
or you can set the sort order so that the items you care 
about most always appear at the top of the stack. To 
customize a stack, position the pointer over the stack 
icon and hold down the mouse button until a menu 
appears. Choose the settings you want from the menu.
;VYLTV]LHÄSLMYVT 
a stack, just open  
the stack and drag the 
item out to where you 
^HU[P[;VKLSL[LHÄSL
move it to the Trash.  
0UMHJ[^OLU`V\»YL
done reading this 
document, feel free  
to throw it out.
Documents Downloads Applications
TM and © 2009 Apple Inc. All rights reserved.
===
The garbage characters are the same that were solved by the change in PDFBOX-490, so its probably a similar cause.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira