You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2012/06/29 17:09:42 UTC

[jira] [Comment Edited] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

    [ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403960#comment-13403960 ] 

Daniel Bonniot de Ruisselet edited comment on PDFBOX-1351 at 6/29/12 3:09 PM:
------------------------------------------------------------------------------

Also please note this is not just a newline issue, but a false paragraph issue, which is at I think more disturbing:

$ java -cp /dev/shm/pdfbox-app-1.6.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
 
</p>

$ java -cp /dev/shm/pdfbox-app-1.7.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:28 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.
</p>
<p>
11
 regarding 1,3-
</p>
<p>
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
</p>
<p>
 
</p>

                
      was (Author: dbr):
    Also not this is not just a newline issue, but a false paragraph issue:

$ java -cp /dev/shm/pdfbox-app-1.6.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
 
</p>

$ java -cp /dev/shm/pdfbox-app-1.7.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:28 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.
</p>
<p>
11
 regarding 1,3-
</p>
<p>
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
</p>
<p>
 
</p>

                  
> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: PDFParaTest.java, superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira