You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/05/17 08:07:42 UTC

[jira] Updated: (PDFBOX-37) Text Extraction Weirdness

     [ https://issues.apache.org/jira/browse/PDFBOX-37?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-37:
-------------------------------------

    Attachment: cbn95_11.pdf
                flash-crowds02.pdf
                Ftcs99_Egida.pdf

> Text Extraction Weirdness
> -------------------------
>
>                 Key: PDFBOX-37
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-37
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>         Attachments: cbn95_11.pdf, flash-crowds02.pdf, Ftcs99_Egida.pdf
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1117515
> Originally submitted by benlitchfield on 2005-02-06 13:20.
> I've now found a few other possible bugs as well 
> (especially look at
> #3, it exhibits several of the problems):
> 1. Extra \ characters showing up next to [] in the 
> references section
> of:
> http://www.cs.utexas.edu/users/lam/Vita/ACM/WooLam9
> 3c.pdf
> 4. [ and ] showing up as #5B and #5D (their hex 
> values), as well as the
> same strange spacing problem I mentioned previously:
> http://www.cs.utexas.edu/users/lasr/pub/pdf/Ftcs99_Egi
> da.pdf
> 3. Many, many missing characters, and the #NN codes 
> from above:
> http://stat.tamu.edu/ftp/pub/rjcarroll/nonparametric.calib
> ration/
> npcal15.pdf
> 4. Similar problem to #3:
> http://www.lns.cornell.edu/public/CBN/1995/CBN95-
> 11/cbn95_11.pdf
> 5. Truly bizarre output from these two:
> http://www.cs.wm.edu/~hnw/courses/cs780/papers/flas
> h-crowds02.pdf
> http://www.pdcl.eng.wayne.edu/msp01/paper10.pdf
> [comment on SourceForge]
> Originally sent by thangaraj_m.
> Logged In: YES 
> user_id=1248110
> PDF conversion into a text file:
> I too agree and when I tried to parse a pdf document I did 
> notice that many character were missing in the final text 
> document. Is there any workaround to avoid missing text 
> issue?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.