You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Enrique Pérez (Created JIRA)" <ji...@apache.org> on 2012/01/23 12:19:41 UTC

[jira] [Created] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Adding style information to the PDF to HTML converter
-----------------------------------------------------

                 Key: PDFBOX-1213
                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
             Project: PDFBox
          Issue Type: Improvement
    Affects Versions: 1.6.0
            Reporter: Enrique Pérez


This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:

[Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194619#comment-13194619 ] 

Timo Boehme commented on PDFBOX-1213:
-------------------------------------

I cannot see why the DOCTYPE declaration is a problem. Maybe something is wrong with your SAX parser configuration, e.g. trying to read the DTD? At least it should be made configurable if doctype will be added.
In order for easier XML processing afterwards I would propose to change HTML doctype to XHTML.
                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Comment Edited] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Aaptha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504559#comment-13504559 ] 

Aaptha edited comment on PDFBOX-1213 at 11/27/12 11:59 AM:
-----------------------------------------------------------

There is a lot of inconsistency in the marking formatting information for the italics. Sometimes the italics are not marked properly and sometimes the italics tag does not get closed. This inconsistency is often seen in a case where you have a line containing multiple italic words mixed with normal text.

What is the strategy for the subscripts and superscripts?

Is there any update on this issue? Is this part of 2.0.0?
                
      was (Author: aaptha):
    There is a lot of inconsistency in the marking formatting information for the italics. Sometimes the italics are not marked properly and sometimes the italics tag does not get closed. This inconsistency is often seen in a case where you have a line containing multiple italic words mixed with normal text.

What is the strategy for the subscripts and superscripts?
                  
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Enrique Pérez (Commented JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194622#comment-13194622 ] 

Enrique Pérez commented on PDFBOX-1213:
---------------------------------------

We've deleted the DOCTYPE declaration because we need to parse the document as a XML document in our application. Anyway, could you evaluate our style information contribution?
                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Timo Boehme (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194697#comment-13194697 ] 

Timo Boehme commented on PDFBOX-1213:
-------------------------------------

In my opinion the proposed changes to PDFTextStripper are too much centered on the use case. I think we need a more general solution here because sometimes more parameters can be extracted from the font definitions.

I would propose a fontChanged notification, maybe as a listener pattern because if no listeners are registered we can skip cycles for font information extraction:

interface FontChangedListener {
    public void fontChanged( FontInformation _fInfo );
}

class FontInformation {
    public boolean isBold();
    public boolean isItalic();
    public boolean isRoman();
    public boolean isSansSerif();
    public String getFontName();
    public float getFontSizePt();
}

class PDFTextStripper {
...
   protected List<FontListener> fontListeners = new LinkedList<FontListener>();
...
   public void registeFontListener( FontListener listener ) {
      fontListeners.add( listener );
   }

   writePage() {
      ...
      if ( ! fontListeners.isEmpty() ) {
         // test for font changes and notify listeners
      }
      ...
   }
}

In PDFText2HTML you have to keep track if a span was opened with font style information and close it before closing other tags.

                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Updated] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Enrique Pérez (Updated JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Enrique Pérez updated PDFBOX-1213:
----------------------------------

    Attachment: diff.patch
    
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

[jira] [Commented] (PDFBOX-1213) Adding style information to the PDF to HTML converter

Posted by "Aaptha (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13504559#comment-13504559 ] 

Aaptha commented on PDFBOX-1213:
--------------------------------

There is a lot of inconsistency in the marking formatting information for the italics. Sometimes the italics are not marked properly and sometimes the italics tag does not get closed. This inconsistency is often seen in a case where you have a line containing multiple italic words mixed with normal text.

What is the strategy for the subscripts and superscripts?
                
> Adding style information to the PDF to HTML converter
> -----------------------------------------------------
>
>                 Key: PDFBOX-1213
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1213
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 1.6.0
>            Reporter: Enrique Pérez
>         Attachments: diff.patch
>
>
> This patch modifies the PDF to HTML conversion in order to add style information (bold, italic and size font) in the resulting file. Moreover, we have deleted the "DOCTYPE" header because some parsers throws the following exception:
> [Fatal Error] loose.dtd:31:3: The declaration for the entity "HTML.Version" must end with '>'.
> org.xml.sax.SAXParseException: The declaration for the entity "HTML.Version" must end with '>'.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira