You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/03/27 16:14:50 UTC

[jira] Updated: (PDFBOX-349) Spaces between words ignored in scanned pdf files

     [ https://issues.apache.org/jira/browse/PDFBOX-349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-349:
-----------------------------------

    Attachment: SpacingFix.zip

With the attached fix, I have made spacing detection better for files such as this one where the page was scanned in a bit skewed. Now, instead of just relying on the reported width of the space character to determine if a space should be added to the text file. PDFBox still has that, but it also keeps a running average of the character widths seen previously. In order to determine if a space should be added, it compares the two widths,picks the smaller one, and adds it to our previous X position to show where we expect the next word to start. If the expected X position is less that our new X position, then we add a space. 

> Spaces between words ignored in scanned pdf files
> -------------------------------------------------
>
>                 Key: PDFBOX-349
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-349
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>            Reporter: Jukka Zitting
>         Attachments: SpacingFix.zip
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1922502&group_id=78314&atid=552832
> I am using PDF-Box-0.7.3.dll with C# and have tested extraction on two
> searchable pdfs that I have scanned in from paper. Spaces between words are
> ignored for both files. I have also tested another pdf file (which I
> downloaded from the internet) and it was parsed correctly. Unfortunately,
> the file is 1.2MB and the upload was blocked. Please send me an email
> (gkobzeff@hotmail.com) and I will reply back with the file.
> Thanks for looking into this.
> Greg
> [Comment on SourceForge]
> Date: 2008-03-23 21:24
> Sender: gkobzeff
> Logged In: YES 
> user_id=2042611
> Originator: YES
> I have scanned the file into a smaller file size. I have attached the
> file.
> Thanks
> File Added: Advanced Pain Mgmt BW.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=271548&aid=1922502

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.