You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2011/01/25 19:10:48 UTC

[jira] Commented: (PDFBOX-949) ExtractText returns junk

    [ https://issues.apache.org/jira/browse/PDFBOX-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986565#action_12986565 ] 

Andreas Lehmkühler commented on PDFBOX-949:
-------------------------------------------

I extracted the text using the current trunk version (see attachment). There are some issues concerning the mathematical formulars and the text within the diagrams, but the text itself looks quite good. 

> ExtractText returns junk
> ------------------------
>
>                 Key: PDFBOX-949
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-949
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.4.0
>         Environment: Ubuntu Linux 10.10, Sun Java 1.6.0_22
>            Reporter: Nikhil Chhaochharia
>            Priority: Minor
>             Fix For: 1.5.0
>
>         Attachments: PDFBOX945-NIPS2010_0566.pdf, PDFBOX945-NIPS2010_0566.txt
>
>
> The PDF file at http://books.nips.cc/papers/files/nips23/NIPS2010_0566.pdf returns some weird characters given below.  No exceptions are thrown.
> The command used was "java -jar pdfbox-app-1.4.0.jar ExtractText -sort -console NIPS2010_0566.pdf"
> 1 1 1 1
> '—;˜: :'¸s ; s :; s˜ :
> h ` s ˆ ; s ;:s ¸ˆ:
> h ` s , s —
> [ ' : o[':p t
> u ˜
> s s
> u t
> t
> u `
> [': u
> 6
> [ ' : fi
> u — s
> u ' s u
> ˜ [': u ˜
> u
> — s s
> s s
> u ˜ u / s
> - - s s s s s
> u ˆ s
> s s
> t 1 u / s
> s o

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.