You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Dmitry Gutso (JIRA)" <ji...@apache.org> on 2009/08/29 11:25:32 UTC
[jira] Created: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Lost spacing as a result of operator "Tc" ignoring.
---------------------------------------------------
Key: PDFBOX-508
URL: https://issues.apache.org/jira/browse/PDFBOX-508
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 0.7.3
Environment: JDK 1.6.0_16
Reporter: Dmitry Gutso
Continue https://issues.apache.org/jira/browse/PDFBOX-234
Lost spacing as a result of operator "Tc" ignoring.
Ex:
****************************************
BT
6 0 0 6 244.0800018311 795.8400268555 Tm
6.5475001335 Tc
(41) Tj
****************************************
Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751118#action_12751118 ]
Dmitry Gutso commented on PDFBOX-508:
-------------------------------------
It's my variant:
PDFStreamEngine_For_Spacing.diff
TextPosition_for_Spacing.diff
I would be grateful, if somebody has tested
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Gutso updated PDFBOX-508:
--------------------------------
Attachment: 2a_repl2.pdf
2a.pdf
file 2a_repl2.pdf is modified file 2a.pdf
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Gutso updated PDFBOX-508:
--------------------------------
Comment: was deleted
(was: Maybe will right insert in "PDFStreamEngine.java" before:
totalCharCnt += c.length();
stringResult.append( c );
the something similar on:
if(c != null && spaceWidthText < spacingText) { c = c + " "; }
Otherwise the columns of table are lost which are created by spacingText (Tc)
)
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12750416#action_12750416 ]
Dmitry Gutso commented on PDFBOX-508:
-------------------------------------
the cause of the error in method processEncodedText of class PDFStreamEngine
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749338#action_12749338 ]
Dmitry Gutso commented on PDFBOX-508:
-------------------------------------
Maybe will right insert in "PDFStreamEngine.java" before:
totalCharCnt += c.length();
stringResult.append( c );
the something similar on:
if(c != null && spaceWidthText < spacingText) { c = c + " "; }
Otherwise the columns of table are lost which are created by spacingText (Tc)
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Issue Comment Edited: (PDFBOX-508) Lost spacing as a result
of operator "Tc" ignoring.
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12792525#action_12792525 ]
Andreas Lehmkühler edited comment on PDFBOX-508 at 12/18/09 4:48 PM:
---------------------------------------------------------------------
Works fine after resolving PDFBOX-520 and PDFBOX-571.
was (Author: lehmi):
Works fine after resolving PDFBOX-571.
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Fix For: 1.0.0
>
> Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Commented: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12751920#action_12751920 ]
Andreas Lehmkühler commented on PDFBOX-508:
-------------------------------------------
First of all thanks for the contribution.
I've made some tests and it worked with your sample, but there are some unwanted sideeffects with other documents. I guess we have to do some more tests, as your patch affects a really fragile part of the textextract part of pdfbox.
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Gutso updated PDFBOX-508:
--------------------------------
Affects Version/s: (was: 0.7.3)
0.8.0-incubator
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andreas Lehmkühler resolved PDFBOX-508.
---------------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Works fine after resolving PDFBOX-571.
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.8.0-incubator
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Fix For: 1.0.0
>
> Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (PDFBOX-508) Lost spacing as a result of operator
"Tc" ignoring.
Posted by "Dmitry Gutso (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/PDFBOX-508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dmitry Gutso updated PDFBOX-508:
--------------------------------
Attachment: TextPosition_for_Spacing.diff
PDFStreamEngine_For_Spacing.diff
> Lost spacing as a result of operator "Tc" ignoring.
> ---------------------------------------------------
>
> Key: PDFBOX-508
> URL: https://issues.apache.org/jira/browse/PDFBOX-508
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 0.7.3
> Environment: JDK 1.6.0_16
> Reporter: Dmitry Gutso
> Attachments: 2a.pdf, 2a_repl2.pdf, PDFStreamEngine_For_Spacing.diff, TextPosition_for_Spacing.diff
>
>
> Continue https://issues.apache.org/jira/browse/PDFBOX-234
> Lost spacing as a result of operator "Tc" ignoring.
> Ex:
> ****************************************
> BT
> 6 0 0 6 244.0800018311 795.8400268555 Tm
> 6.5475001335 Tc
> (41) Tj
> ****************************************
> Here PDFTextStripper.writeText() returns "41" (without spacing )
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.