You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Justin LeFebvre (JIRA)" <ji...@apache.org> on 2009/02/24 20:59:02 UTC

[jira] Created: (PDFBOX-434) Improve html output

Improve html output
-------------------

                 Key: PDFBOX-434
                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
             Project: PDFBox
          Issue Type: Improvement
          Components: Text extraction
            Reporter: Justin LeFebvre


Would like to improve the html output of pdf files for arabic rendering. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-434) Improve html output

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-434:
-----------------------------------

    Attachment: html_improvements.diff

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The attached file has changes that should improve the way the -html option works. Now, output files are tagged with the .html extension. We also added <DOCTYPE> information as well as a <meta> tag which writes the appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper and PDFText2HTML which wasn't being used. Added ability to set the <title> tag of the html document to be the title given in the pdf document information if it exists. Otherwise it will guess a title from the beginning first lines of the file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-434) Improve html output

Posted by "Brian Carrier (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brian Carrier resolved PDFBOX-434.
----------------------------------

    Resolution: Fixed

Checked in slight variation of this patch.  The original patch would have failed for console output. 

Note that this patch changes the PDFTextStripper.beginParagraph() and PDFTextStripper.endParagraph() methods to PDFTextStripper.beginArticle() and PDFTextStripper.endArticle(), which are more accurate names. PDFBox currently has no way to detect paragraph boundaries and these functions are called at the beginning and end of each column on each page.

Sending        trunk/src/main/java/org/apache/pdfbox/ExtractText.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFText2HTML.java
Sending        trunk/src/main/java/org/apache/pdfbox/util/PDFTextStripper.java
Transmitting file data ...
Committed revision 747858.

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The attached file has changes that should improve the way the -html option works. Now, output files are tagged with the .html extension. We also added <DOCTYPE> information as well as a <meta> tag which writes the appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper and PDFText2HTML which wasn't being used. Added ability to set the <title> tag of the html document to be the title given in the pdf document information if it exists. Otherwise it will guess a title from the beginning first lines of the file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-434) Improve html output

Posted by "Justin LeFebvre (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin LeFebvre updated PDFBOX-434:
-----------------------------------

    Description: Would like to improve the html output of pdf files for arabic rendering. The attached file has changes that should improve the way the -html option works. Now, output files are tagged with the .html extension. We also added <DOCTYPE> information as well as a <meta> tag which writes the appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper and PDFText2HTML which wasn't being used. Added ability to set the <title> tag of the html document to be the title given in the pdf document information if it exists. Otherwise it will guess a title from the beginning first lines of the file.   (was: Would like to improve the html output of pdf files for arabic rendering. )

> Improve html output
> -------------------
>
>                 Key: PDFBOX-434
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-434
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Justin LeFebvre
>         Attachments: html_improvements.diff
>
>
> Would like to improve the html output of pdf files for arabic rendering. The attached file has changes that should improve the way the -html option works. Now, output files are tagged with the .html extension. We also added <DOCTYPE> information as well as a <meta> tag which writes the appropriate encoding of the file. Cleaned up a lot of code from PDFTextStripper and PDFText2HTML which wasn't being used. Added ability to set the <title> tag of the html document to be the title given in the pdf document information if it exists. Otherwise it will guess a title from the beginning first lines of the file. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.