You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Dmitry Gutso (JIRA)" <ji...@apache.org> on 2009/08/29 11:25:32 UTC

[jira] Created: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Lost spacing as a result of operator "Tc" ignoring.
---------------------------------------------------

                 Key: PDFBOX-508
                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.7.3
         Environment: JDK 1.6.0_16
            Reporter: Dmitry Gutso


Continue https://issues.apache.org/jira/browse/PDFBOX-234

Lost spacing as a result of operator "Tc" ignoring.
Ex:
****************************************
BT
 6 0 0 6 244.0800018311 795.8400268555 Tm
 6.5475001335 Tc
 (41) Tj
****************************************
Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751118#action_12751118 ] 

Dmitry Gutso commented on PDFBOX-508:
-------------------------------------

It's my variant:
PDFStreamEngine_For_Spacing.diff
TextPosition_for_Spacing.diff
I would be grateful, if somebody has tested

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Gutso updated PDFBOX-508:
--------------------------------

    Attachment: 2a_repl2.pdf
                2a.pdf

file 2a_repl2.pdf is modified file 2a.pdf

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Gutso updated PDFBOX-508:
--------------------------------

    Comment: was deleted

(was: Maybe will right insert in "PDFStreamEngine.java" before:

            totalCharCnt += c.length();
            
            stringResult.append( c );


the something similar on:

 if(c != null && spaceWidthText < spacingText)             {             c = c + " ";            }

Otherwise the columns of table are lost which are created by spacingText (Tc) 
)

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750416#action_12750416 ] 

Dmitry Gutso commented on PDFBOX-508:
-------------------------------------

the cause of the error in method processEncodedText of class PDFStreamEngine

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749338#action_12749338 ] 

Dmitry Gutso commented on PDFBOX-508:
-------------------------------------

Maybe will right insert in "PDFStreamEngine.java" before:

            totalCharCnt += c.length();
            
            stringResult.append( c );


the something similar on:

 if(c != null && spaceWidthText < spacingText)             {             c = c + " ";            }

Otherwise the columns of table are lost which are created by spacingText (Tc) 


> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Issue Comment Edited: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792525#action_12792525 ] 

Andreas Lehmkühler edited comment on PDFBOX-508 at 12/18/09 4:48 PM:
---------------------------------------------------------------------

Works fine after resolving PDFBOX-520 and PDFBOX-571.

      was (Author: lehmi):
    Works fine after resolving PDFBOX-571.
  
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>             Fix For: 1.0.0
>
>         Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751920#action_12751920 ] 

Andreas Lehmkühler commented on PDFBOX-508:
-------------------------------------------

First of all thanks for the contribution.

I've made some tests and it worked with your sample, but there are some unwanted sideeffects with other documents. I guess we have to do some more tests, as your patch affects a really fragile part of the textextract part of pdfbox.


> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Gutso updated PDFBOX-508:
--------------------------------

    Affects Version/s:     (was: 0.7.3)
                       0.8.0-incubator

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Resolved: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-508.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Works fine after resolving PDFBOX-571.

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>             Fix For: 1.0.0
>
>         Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator "Tc" ignoring.

Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dmitry Gutso updated PDFBOX-508:
--------------------------------

    Attachment: TextPosition_for_Spacing.diff
                PDFStreamEngine_For_Spacing.diff

> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
>                 Key: PDFBOX-508
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-508
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.7.3
>         Environment: JDK 1.6.0_16
>            Reporter: Dmitry Gutso
>         Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
>  6 0 0 6 244.0800018311 795.8400268555 Tm
>  6.5475001335 Tc
>  (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.