You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Bernard (JIRA)" <ji...@apache.org> on 2010/06/01 14:27:36 UTC

[jira] Issue Comment Edited: (PDFBOX-586) Text Extraction Regression ?

    [ https://issues.apache.org/jira/browse/PDFBOX-586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12873972#action_12873972 ] 

Bernard edited comment on PDFBOX-586 at 6/1/10 8:26 AM:
--------------------------------------------------------

Hi,

I have just tried the .jar on 3 'bad' PDF and it works fine.  I wonder why the sources (from .zip) didn't...
As I have not IBM (encryption) lib (cold you add a link to them it the download page ?) I had to comment all that.
But : I don't care about encrypted PDF for now.

I have also commented Bi-Di text handling : no need yet.  And I have not the lib. source.

As I run on Android, I had to comment 20% of Font/awt related stuff.  I need the characters, I don't care about viewing the PDF page.

After all those change my PDF were successfuly opened but PDFBox 0.7.3, but some PDF didn't work with PDFBox 1.1.0.


I will continue investigating....  (and not working on my app :-( :    http://bsegonnes.free.fr/multireader/en_multireader.html

      was (Author: bsegonnes):
    Hi,

I have just tried the .jar on 3 'bad' PDF and it works fine.  I wonder why the sources (from .zip) didn't...
As I have not IBM (encryption) lib (cold you add a link to them it the download page ?) I had to comment all that.
But : I don't care about encrypted PDF for now.

I have also commented Bi-Di text handling : no need yet.  And I have not the lib. source.

As I run on Android, I had to comment 20% of Font/awt related stuff.  I need the characters, I don't care about viewing the PDF page.

After all those change my PDF were successfuly opened but PDFBox 0.7.3, but some PDF didn't work with PDFBox 1.1.0.


I will continue investigating....  (and not working on my app :-(
  
> Text Extraction Regression ?
> ----------------------------
>
>                 Key: PDFBOX-586
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-586
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.1.0
>         Environment: Windows XP + Eclipse + PDFBox sources
>            Reporter: Bernard
>         Attachments: ASEB-Camping_Car_ou_Bateau.pdf, Eval.pdf, internals.pdf, PDFBOX586-ASEB-Camping_Car_ou_Bateau.txt, PDFBOX586-Eval.txt, PDFBOX586-internals.txt
>
>
> Hi,
> I have noticed that I can extract text some PDF files in PDFBox 0.7.4 but for the same file, the same page, PDFBox 1.1.0 doesn't retreive any text, or the extraction is worst.
> Am I the only only one who think there is a regression in text extraction ?
> My code is like this :
>    PDDocument document = PDDocument.load("/sdcard/internals.pdf"); 
>     int numberOfPages = document.getNumberOfPages();
>     resources = this.getResources();
>   
>   android.util.Log.d(TEST_PDFBOX, "readerPDF() resources : "+resources);  // ANDROID code here to get file
>    resourceGlyphList = R.raw.glyphlist;
>    InputStream rawResource = resources.openRawResource(R.raw.pdftextstripper);   // PDFBOX property file
>    android.util.Log.d(TEST_PDFBOX, "readerPDF() rawResource : "+rawResource);
>    Properties properties = new Properties();
>     properties.load(rawResource);
>     		
>    PDFTextStripper stripper = new PDFTextStripper(properties );
>     		
>   stripper.setStartPage(pageNumber );    //   1 or any other page
>   stripper.setEndPage(pageNumber );   // same page as above
>    String s = "Page : "+pageNumber+"<br><br>"+stripper.getText(document);
>    android.util.Log.d(TEST_PDFBOX, "readerPDF()  stripper extract pages text : "+s);
> Maybe I should use page.getContents().getStream()   or stripper.getTextForRegion( "class1" )  or  	stripper.writeText(doc, outputStream)
> I want the text as a String, not as a newly created file....

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.