You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/10/11 06:19:34 UTC

[jira] [Comment Edited] (PDFBOX-2246) PDFTextStripper should handle colors

    [ https://issues.apache.org/jira/browse/PDFBOX-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167992#comment-14167992 ] 

John Hewson edited comment on PDFBOX-2246 at 10/11/14 4:19 AM:
---------------------------------------------------------------

+1 As long as we don't merge runs of different color then we can store the text color in a TextPosition as a PDColor, and in 2.0 the text mode can be taken into account in PDFTextStreamEngine#showGlyph() before the TextPosition is created. Also, in 2.0 the extra operators should be added to the constructor of PDFTextStreamEngine, as the .properties mechanism has been removed. Actually, given the differences we might want to make this fix 2.0-only.


was (Author: jahewson):
+1 As long as we don't merge runs of different color then we can store the text color in a TextPosition, and in 2.0 the text mode can be taken into account in PDFTextStreamEngine#showGlyph() before the TextPosition is created. Also, in 2.0 the extra operators should be added to the constructor of PDFTextStreamEngine, as the .properties mechanism has been removed. Actually, given the differences we might want to make this fix 2.0-only.

> PDFTextStripper should handle colors
> ------------------------------------
>
>                 Key: PDFBOX-2246
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2246
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>    Affects Versions: 1.8.6, 1.8.7, 2.0.0
>            Reporter: Tilman Hausherr
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> A recent thread in the dev mailing lst (with Aaron H.) dealt with the inability to extract color with PDFTextStripper. The solution was to create a  PDFTextStripper with these entries to the properties file
> {code}
> CS=org.apache.pdfbox.util.operator.SetStrokingColorSpace
> cs=org.apache.pdfbox.util.operator.SetNonStrokingColorSpace
> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
> G=org.apache.pdfbox.util.operator.SetStrokingGrayColor
> g=org.apache.pdfbox.util.operator.SetNonStrokingGrayColor
> K=org.apache.pdfbox.util.operator.SetStrokingCMYKColor
> k=org.apache.pdfbox.util.operator.SetNonStrokingCMYKColor
> RG=org.apache.pdfbox.util.operator.SetStrokingRGBColor
> rg=org.apache.pdfbox.util.operator.SetNonStrokingRGBColor
> SC=org.apache.pdfbox.util.operator.SetStrokingColor
> sc=org.apache.pdfbox.util.operator.SetNonStrokingColor
> SCN=org.apache.pdfbox.util.operator.SetStrokingColor
> scn=org.apache.pdfbox.util.operator.SetNonStrokingColor
> {code}
> I therefore propose (and I'd like to get at least one "+1" before starting because I've never worked on that segment before):
> - replacing the empty entries in the PDFTextStripper property file with the ones above
> - improve the printtextlocations example 
> The problem has come up before: PDFBOX-1736, http://stackoverflow.com/q/10844271/535646 , http://stackoverflow.com/a/9157714/535646 and the solutions presented are rather cumbersome (using a PageDrawer object).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)