You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org> on 2012/06/29 16:56:42 UTC

[jira] [Created] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

Daniel Bonniot de Ruisselet created PDFBOX-1351:
---------------------------------------------------

             Summary: False paragraph caused by superscript (1.7 regression)
                 Key: PDFBOX-1351
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 1.7.0
            Reporter: Daniel Bonniot de Ruisselet


On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.

Note that 1.6 is processing this case well:

{noformat}
$ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt 
  
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
 
 
$ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
$ cat /tmp/superscript.txt                                                 
  
Multiple synthetic routes have been described by R. Filler et al.
11
 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
 
 
{noformat}


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Comment Edited] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

Posted by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403960#comment-13403960 ] 

Daniel Bonniot de Ruisselet edited comment on PDFBOX-1351 at 6/29/12 3:09 PM:
------------------------------------------------------------------------------

Also please note this is not just a newline issue, but a false paragraph issue, which is at I think more disturbing:

$ java -cp /dev/shm/pdfbox-app-1.6.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
 
</p>

$ java -cp /dev/shm/pdfbox-app-1.7.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:28 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.
</p>
<p>
11
 regarding 1,3-
</p>
<p>
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
</p>
<p>
 
</p>

                
      was (Author: dbr):
    Also not this is not just a newline issue, but a false paragraph issue:

$ java -cp /dev/shm/pdfbox-app-1.6.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
 
</p>

$ java -cp /dev/shm/pdfbox-app-1.7.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:28 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.
</p>
<p>
11
 regarding 1,3-
</p>
<p>
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
</p>
<p>
 
</p>

                  
> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: PDFParaTest.java, superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

Posted by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13403960#comment-13403960 ] 

Daniel Bonniot de Ruisselet commented on PDFBOX-1351:
-----------------------------------------------------

Also not this is not just a newline issue, but a false paragraph issue:

$ java -cp /dev/shm/pdfbox-app-1.6.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
 
</p>

$ java -cp /dev/shm/pdfbox-app-1.7.0.jar:classes tmp.PDFParaTest /tmp/superscript.pdf 
Jun 29, 2012 5:02:28 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
WARNING: expected='%%EOF' actual='5 0 obj '
<p>
 
<p>
 
Multiple synthetic routes have been described by R. Filler et al.
</p>
<p>
11
 regarding 1,3-
</p>
<p>
Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
</p>
<p>
 
</p>
<p>
 
</p>

                
> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

Posted by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Bonniot de Ruisselet updated PDFBOX-1351:
------------------------------------------------

    Attachment: PDFParaTest.java
    
> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: PDFParaTest.java, superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (PDFBOX-1351) False paragraph caused by superscript (1.7 regression)

Posted by "Daniel Bonniot de Ruisselet (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Daniel Bonniot de Ruisselet updated PDFBOX-1351:
------------------------------------------------

    Attachment: superscript.pdf

This file is generated using PDFedit for deleting most stuff from the original document, both to create a minimal testcase and to remove potentially confidential information. The generated file triggers a warning, but displays fine in acrobat reader. This particular bug is the same on the original and on this simplified version. I could not recreate this case from scratch, but maybe someone will know better.
                
> False paragraph caused by superscript (1.7 regression)
> ------------------------------------------------------
>
>                 Key: PDFBOX-1351
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1351
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.7.0
>            Reporter: Daniel Bonniot de Ruisselet
>         Attachments: superscript.pdf
>
>
> On the attached minimal example document, text extraction seems to be confused by the superscript, and generates three paragraphs where there is only one.
> Note that 1.6 is processing this case well:
> {noformat}
> $ java -jar /dev/shm/pdfbox-app-1.6.0.jar ExtractText /tmp/superscript.pdf
> Jun 29, 2012 4:52:24 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt 
>   
> Multiple synthetic routes have been described by R. Filler et al.11 regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> $ java -jar /dev/shm/pdfbox-app-1.7.0.jar ExtractText /tmp/superscript.pdf 
> Jun 29, 2012 4:52:39 PM org.apache.pdfbox.pdfparser.PDFParser parseObject
> WARNING: expected='%%EOF' actual='5 0 obj '
> $ cat /tmp/superscript.txt                                                 
>   
> Multiple synthetic routes have been described by R. Filler et al.
> 11
>  regarding 1,3-
> Bis(perfluorophenyl)propane-1,3-dione.  The synthesis and 
>  
>  
> {noformat}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira