You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/03/14 22:38:46 UTC

[jira] [Issue Comment Deleted] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

     [ https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson updated PDFBOX-1988:
--------------------------------

    Comment: was deleted

(was: {quote}
adding a line for Courier-Bold in the PDFBox_External_Fonts.properties file.
{quote}

Courier-Bold is one of the Standard 14 fonts, it's not an external font. Also it's a Type 1 font.)

> PDFBox ExtractText issue of PDF with no embedded fonts
> ------------------------------------------------------
>
>                 Key: PDFBOX-1988
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1988
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Rendering, Text extraction
>    Affects Versions: 1.8.4
>         Environment: Windows 7
> Also, PASE on IBM i
>            Reporter: Craig Strong
>              Labels: patch
>             Fix For: 1.8.5, 2.0.0
>
>         Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF files fine.  I use the latest PDFBox app with ExtractText command line.  There is one PDF that PDFBox (and iText) fails to extract any text even though I can extract the text with Adobe Reader and also pdftotext.exe part of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I don't want to have to rely on using pdftotext.exe from a PC since this is part of an automated application.  I think the error relates to an unknown font type and having to use the few fonts installed in the jar file.  I tried running the API classes and trying to force a font from a certain location but I still got errors.  I thought I loaded the font with the loadTTF method but I don't know if that did anything with the font.  I would really like to have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and our IBM i in the PASE environment but I get the same errors.  The section starting processEncodedText and on repeats a few times so I just included the first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory createFont                           
> WARNING: Substituting TrueType for unknown font subtype=                                                  
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator                            
> WARNING: java.lang.NullPointerException                                                                   
> Throwable occurred: java.lang.NullPointerException                                                        
>         at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
>         at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)    
>         at org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.<init>(PDTrueTypeFont.java:119)    
>         at org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)  
>         at org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)             
>         at org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604)        
>         at org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)         
>         at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554) 
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
>         at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)   
>         at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)     
>         at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)    
>         at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)       
>         at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)              
>         at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                          
>         at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                    
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processEncodedText           
> WARNING: java.lang.NullPointerException                                                     
> Throwable occurred: java.lang.NullPointerException                                            
>         at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
>         at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)                 
>         at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)   
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)  
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)  
>         at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)     
>         at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)       
>         at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)      
>         at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)         
>         at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)                
>         at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                            
>         at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                      
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine processOperator                
> WARNING: java.lang.NullPointerException                                                       
> Throwable occurred: java.lang.NullPointerException                                            
>         at org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:364)
>         at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45)                 
>         at org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)   
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)  
>         at org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)  
>         at org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)     
>         at org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)       
>         at org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381)      
>         at org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)         
>         at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)                
>         at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)                            
>         at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)                                      
> Thanks,
> Craig Strong



--
This message was sent by Atlassian JIRA
(v6.2#6252)