You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Brian Carrier (JIRA)" <ji...@apache.org> on 2009/02/24 17:34:03 UTC

[jira] Commented: (PDFBOX-61) Spaces in extracted file

    [ https://issues.apache.org/jira/browse/PDFBOX-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676327#action_12676327 ] 

Brian Carrier commented on PDFBOX-61:
-------------------------------------

Note that Adobe Reader also messes up on this file. It is because PDFBox needs to guess where some spaces should go and the guessing works better with some fonts than others. The trunk currently has a calculation in PDFTextStripper.writePage() that uses a value of 0.50 to estimate the next location. When I change that value to 0.65, then the Tom_3 file comes out fine (0.60 still causes an extra space). However, several of the regression tests start to fail quite badly when 0.60 and above are used...

There seem to be two options:
1) We make the fraction setting be more configurable via an API so that callers can change it for files that they know have non-typical font shapes and sizes (and keep the current 0.5 value as the default).
2) We try to find a better way to estimate where the next character should be.


> Spaces in extracted file
> ------------------------
>
>                 Key: PDFBOX-61
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-61
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1208824
> Originally submitted by nobody on 2005-05-25 16:40.
> In trying to integrate with lucene, I was having 
> problems.  The Lucene people suggested that I check 
> the output of extract utility against one of my test pdf's.  
> When I did, I saw spaces placed inside many of the 
> words.  I was on version 0.7.0.  So I downloaded 0.7.1 
> and see the same results.
> One of the test files where I see this issue is attached.
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1208824&file_id=135995
> Tom_3.pdf (application/pdf), 10145 bytes
> Test pdf file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.