You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2009/04/01 16:35:13 UTC

[jira] Resolved: (PDFBOX-349) Spaces between words ignored in scanned pdf files

     [ https://issues.apache.org/jira/browse/PDFBOX-349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-349.
----------------------------------

    Resolution: Fixed

Fix checked into trunk. 

Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFStreamEngine.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Sending        trunk/test/input/10101-AR.pdf-sorted.txt
Sending        trunk/test/input/10101-AR.pdf.txt
Sending        trunk/test/input/601501018.pdf-sorted.txt
Sending        trunk/test/input/Exolab.pdf-sorted.txt
Sending        trunk/test/input/Exolab.pdf.txt
Sending        trunk/test/input/Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf-sorted.txt
Sending        trunk/test/input/Garcia2003b__Correlative_exploration_of_EEG_Signals.pdf.txt
Sending        trunk/test/input/Garcia2004_thesis.pdf-sorted.txt
Sending        trunk/test/input/Garcia2004_thesis.pdf.txt
Sending        trunk/test/input/JavaMail-1.2.pdf-sorted.txt
Sending        trunk/test/input/JavaMail-1.2.pdf.txt
Sending        trunk/test/input/Michel2001__Review_p2_structured.pdf-sorted.txt
Sending        trunk/test/input/Michel2001__Review_p2_structured.pdf.txt
Sending        trunk/test/input/OSP_framework.pdf-sorted.txt
Sending        trunk/test/input/OSP_framework.pdf.txt
Sending        trunk/test/input/SphericalHomeomorphism.pdf-sorted.txt
Sending        trunk/test/input/SphericalHomeomorphism.pdf.txt
Sending        trunk/test/input/T05140.pdf-sorted.txt
Sending        trunk/test/input/T05140.pdf.txt
Sending        trunk/test/input/amyuni2_05d__pdf1_3_acro4x.pdf-sorted.txt
Sending        trunk/test/input/amyuni2_05d__pdf1_3_acro4x.pdf.txt
Sending        trunk/test/input/authentication.pdf-sorted.txt
Sending        trunk/test/input/authentication.pdf.txt
Sending        trunk/test/input/c21-5916 .pdf-sorted.txt
Sending        trunk/test/input/c21-5916 .pdf.txt
Sending        trunk/test/input/cweb.pdf-sorted.txt
Sending        trunk/test/input/cweb.pdf.txt
Sending        trunk/test/input/defensive_driving_class_schedule.pdf-sorted.txt
Sending        trunk/test/input/defensive_driving_class_schedule.pdf.txt
Sending        trunk/test/input/hexnumberproblem.pdf-sorted.txt
Sending        trunk/test/input/hexnumberproblem.pdf.txt
Sending        trunk/test/input/null_thread_bead.pdf-sorted.txt
Sending        trunk/test/input/null_thread_bead.pdf.txt
Sending        trunk/test/input/ocalc.pdf-sorted.txt
Sending        trunk/test/input/ocalc.pdf.txt
Sending        trunk/test/input/pdf_with_lots_of_fields.pdf-sorted.txt
Sending        trunk/test/input/pdf_with_lots_of_fields.pdf.txt
Sending        trunk/test/input/rc5.pdf-sorted.txt
Sending        trunk/test/input/rc5.pdf.txt
Sending        trunk/test/input/ruminations.pdf-sorted.txt
Sending        trunk/test/input/ruminations.pdf.txt
Sending        trunk/test/input/sample_fonts_solidconvertor.pdf-sorted.txt
Sending        trunk/test/input/sample_fonts_solidconvertor.pdf.txt
Sending        trunk/test/input/sha256.pdf-sorted.txt
Sending        trunk/test/input/sha256.pdf.txt
Sending        trunk/test/input/surface_interpolation.pdf-sorted.txt
Sending        trunk/test/input/surface_interpolation.pdf.txt
Sending        trunk/test/input/tech_report.pdf-sorted.txt
Sending        trunk/test/input/tech_report.pdf.txt
Sending        trunk/test/input/warp.pdf-sorted.txt
Sending        trunk/test/input/warp.pdf.txt
Transmitting file data .....................................................
Committed revision 760902.

> Spaces between words ignored in scanned pdf files
> -------------------------------------------------
>
>                 Key: PDFBOX-349
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-349
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>         Attachments: SpacingFix.zip, UpdatedSpacingRegressionFiles.zip
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832
> I am using PDF-Box-0.7.3.dll with C# and have tested extraction on two
> searchable pdfs that I have scanned in from paper. Spaces between words are
> ignored for both files. I have also tested another pdf file (which I
> downloaded from the internet) and it was parsed correctly. Unfortunately,
> the file is 1.2MB and the upload was blocked. Please send me an email
> (gkobzeff@hotmail.com) and I will reply back with the file.
> Thanks for looking into this.
> Greg
> [Comment on SourceForge]
> Date: 2008-03-23 21:24
> Sender: gkobzeff
> Logged In: YES 
> user_id=2042611
> Originator: YES
> I have scanned the file into a smaller file size. I have attached the
> file.
> Thanks
> File Added: Advanced Pain Mgmt BW.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.