You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 01:06:35 UTC
[jira] [Closed] (PDFBOX-37) Text Extraction Weirdness

     [ https://issues.apache.org/jira/browse/PDFBOX-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Hewson closed PDFBOX-37.
-----------------------------
       Resolution: Fixed
    Fix Version/s: 2.0.0

I tested these files in 2.0 and they're essentially fixed:

{{cbn95_11.pdf}} - OK
{{flash-crowds02}} - OK, the PDF embeds bad characters, even Acrobat cannot extract.
{{Ftcs99_Egida}} - OK
{{npcal15}} - OK
{{WooLam93c}} - Bad detection of spaces in the title, introduced by PDFText stripper, I've opened a new issue for this, PDFBOX-2425.

> Text Extraction Weirdness
> -------------------------
>
>                 Key: PDFBOX-37
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-37
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>             Fix For: 2.0.0
>
>         Attachments: Ftcs99_Egida.pdf, WooLam93c.pdf, cbn95_11.pdf, flash-crowds02.pdf, npcal15.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1117515
> Originally submitted by benlitchfield on 2005-02-06 13:20.
> I've now found a few other possible bugs as well 
> (especially look at
> #3, it exhibits several of the problems):
> 1. Extra \ characters showing up next to [] in the 
> references section
> of:
> http://www.cs.utexas.edu/users/lam/Vita/ACM/WooLam9
> 3c.pdf
> 4. [ and ] showing up as #5B and #5D (their hex 
> values), as well as the
> same strange spacing problem I mentioned previously:
> http://www.cs.utexas.edu/users/lasr/pub/pdf/Ftcs99_Egi
> da.pdf
> 3. Many, many missing characters, and the #NN codes 
> from above:
> http://stat.tamu.edu/ftp/pub/rjcarroll/nonparametric.calib
> ration/
> npcal15.pdf
> 4. Similar problem to #3:
> http://www.lns.cornell.edu/public/CBN/1995/CBN95-
> 11/cbn95_11.pdf
> 5. Truly bizarre output from these two:
> http://www.cs.wm.edu/~hnw/courses/cs780/papers/flas
> h-crowds02.pdf
> http://www.pdcl.eng.wayne.edu/msp01/paper10.pdf
> [comment on SourceForge]
> Originally sent by thangaraj_m.
> Logged In: YES 
> user_id=1248110
> PDF conversion into a text file:
> I too agree and when I tried to parse a pdf document I did 
> notice that many character were missing in the final text 
> document. Is there any workaround to avoid missing text 
> issue?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)