You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by "C. Alexander Leigh" <a...@a6v.org> on 2016/06/15 17:10:31 UTC

Suppressing layers on output

I have need to suppress layers on the render, which came about because I am
working with GeoPDFs (maps) containing many layers. At any given time the
end-user is only going to be interested in viewing a subset of the layers.
I saw several people had asked this question in the past but I never saw an
implementation, so I thought I would post what I did in hopes it might
bring others good fortune.

Note: I do not claim this is a complete implementation - I am a novice at
the PDF specification, however, it did work for me with commercially
produced PDFs. Is it right? I have no idea. It works. I implemented
everything outside of pdfbox with the exception that I had to change
processStreamOperators() on PDFStreamEngine to public so that I could
override it.

All code herein is released by me, the author, into the public domain. Go
forth.

BDC and EMC operators need to be implemented. These will track which layer
the contained PDF bits are representing. It turns out these can stack and
not all BDCs will contain layer information - but they will have matching
EMCs, so you have to account for them.

=====

package org.a6v.pdf;

import org.apache.pdfbox.contentstream.operator.Operator;
import
org.apache.pdfbox.contentstream.operator.graphics.GraphicsOperatorProcessor;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSName;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.List;

public class BDCOperatorProcessor extends GraphicsOperatorProcessor {
    private static final Logger logger =
LoggerFactory.getLogger(BDCOperatorProcessor.class);
    private final FilteringPageDrawer drawer;


    public BDCOperatorProcessor(FilteringPageDrawer drawer) {
        this.drawer = drawer;
    }

    /**
     * Process the operator.
     *
     * @param operator the operator to process
     * @param operands the operands to use when processing
     * @throws IOException if the operator cannot be processed
     */
    @Override
    public void process(Operator operator, List<COSBase> operands) throws
IOException {
        // Should be the name of the layer in MC syntax
        if (operands.size() < 2) {
            logger.debug("Operands list was short");
            drawer.getBdcStack().add(null);
            return;
        }

        COSBase name = operands.get(1);

        if (!(name instanceof COSName)) {
            logger.debug("Name is not a COSName: {}", name.getClass());
            drawer.getBdcStack().add(null);
            return;
        }

        String n = ((COSName) name).getName();

        if (n == null) {
            logger.debug("Key OC not found in dictionary");
            drawer.getBdcStack().add(null);
            return;
        }

        logger.debug("Determined layer name: {}", n);

        drawer.getBdcStack().add(n);
    }

    /**
     * Returns the name of this operator, e.g. "BI".
     */
    @Override
    public String getName() {
        return "BDC";
    }
}

===

package org.a6v.pdf;

import org.apache.pdfbox.contentstream.operator.Operator;
import
org.apache.pdfbox.contentstream.operator.graphics.GraphicsOperatorProcessor;
import org.apache.pdfbox.cos.COSBase;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.List;

public class EMCOperatorProcessor extends GraphicsOperatorProcessor {
    private static final Logger logger =
LoggerFactory.getLogger(EMCOperatorProcessor.class);
    private final FilteringPageDrawer drawer;


    public EMCOperatorProcessor(FilteringPageDrawer drawer) {
        logger.debug("Created");
        this.drawer = drawer;
    }

    /**
     * Process the operator.
     *
     * @param operator the operator to process
     * @param operands the operands to use when processing
     * @throws IOException if the operator cannot be processed
     */
    @Override
    public void process(Operator operator, List<COSBase> operands) throws
IOException {
        logger.debug("called: {} {}", operator, operands);

        List<String> stack = drawer.getBdcStack();

        stack.remove(stack.size() - 1);

        logger.debug("Current content stack: {}", stack);
    }

    /**
     * Returns the name of this operator, e.g. "BI".
     */
    @Override
    public String getName() {
        return "EMC";
    }
}

===

Once this is done, only two classes need to be extended, the PDFRenderer
and the PageDrawer.

===

package org.a6v.pdf;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.rendering.PDFRenderer;
import org.apache.pdfbox.rendering.PageDrawer;
import org.apache.pdfbox.rendering.PageDrawerParameters;

import java.io.IOException;
import java.util.HashSet;

/**
 * This implementation of <code>PDFRenderer</code> is capable of
suppressing layers on the render.
 *
 * @author C. Alexander Leigh
 */
public class FilteringRenderer extends PDFRenderer {
    private final HashSet<String> hiddenList;

    /**
     * Creates a new PDFRenderer.
     *
     * @param document the document to render
     */
    public FilteringRenderer(PDDocument document, HashSet<String>
hiddenList) {
        super(document);
        this.hiddenList = hiddenList;
    }

    protected PageDrawer createPageDrawer(PageDrawerParameters parameters)
throws IOException {
        return new FilteringPageDrawer(parameters, hiddenList);
    }
}

===

The page drawer implementation is the real meat of it. Here we have to
track whether or not we are rendering the current
layer - note that we will suppress children if a parent is being hidden.
When we are suppressing we simply don't process
the content on the stream, unless it happens to be a BDC or a EMC. Once we
are not suppressing anymore everything goes
back to normal.

===

package org.a6v.pdf;

import org.apache.pdfbox.contentstream.PDContentStream;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.cos.COSObject;
import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.rendering.PageDrawer;
import org.apache.pdfbox.rendering.PageDrawerParameters;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;

public class FilteringPageDrawer extends PageDrawer {
    private static final Logger logger =
LoggerFactory.getLogger(FilteringPageDrawer.class);
    private final ArrayList<String> bdcStack = new ArrayList<>();
    private final HashSet<String> hiddenList;

    /**
     * Constructor.
     *
     * @param parameters Parameters for page drawing.
     * @throws IOException If there is an error loading properties from the
file.
     */
    public FilteringPageDrawer(PageDrawerParameters parameters,
HashSet<String> hiddenList) throws IOException {
        super(parameters);
        this.hiddenList = hiddenList;
        addOperator(new EMCOperatorProcessor(this));
        addOperator(new BDCOperatorProcessor(this));

        logger.info("Created: {}", parameters);
    }

    /**
     * Returns <code>true</code> if the renderer is currently rendering,
otherwise, returns <code>false</code>.
     * If <code>false</code> is returned, then anything in the PDF stream
should not be rendered to the output.
     *
     * Rendering will be suppressed if any of the optional content tags
currently opened are also contained within
     * the suppression list. This side-effects suppresing children of a
given optional content.
     */
    public boolean isRendering() {
        for (String idx : bdcStack) {
            if (hiddenList.contains(idx)) {
                logger.debug("Filtering...");
                return false;
            }
        }

        return true;
    }

    /**
     * Returns the current BDC stack for this drawer. Each time a BDC is
processed, the name of the content
     * is pushed onto this stack. When an EMC is processed, the
corresponding entry is pulled off. This is
     * coordinated by the <code>BDCOperatorProcessor</code> and
<code>EMCOperatorProcessor</code> classes.
     *
     * @return
     */
    public List<String> getBdcStack() {
        return bdcStack;
    }

    /**
     * Processes the operators of the given content stream.
     *
     * @param contentStream to content stream to parse.
     * @throws IOException if there is an error reading or parsing the
content stream.
     */
    public void processStreamOperators(PDContentStream contentStream)
throws IOException {
        logger.debug("Called");

        List<COSBase> arguments = new ArrayList<COSBase>();
        PDFStreamParser parser = new PDFStreamParser(contentStream);
        Object token = parser.parseNextToken();
        while (token != null) {

            if (isRendering()) {
                if (token instanceof COSObject) {
                    arguments.add(((COSObject) token).getObject());
                } else if (token instanceof Operator) {
                    processOperator((Operator) token, arguments);
                    arguments = new ArrayList<>();
                } else {
                    arguments.add((COSBase) token);
                }
            } else {
                // If we are not currently rendering, we only process EMC
and BDC
                if (token instanceof Operator) {
                    String tokenName = ((Operator) token).getName();
                    if (tokenName.equals("BDC") || tokenName.equals("EMC"))
{
                        processOperator((Operator) token, arguments);
                        arguments = new ArrayList<>();
                    }
                }
            }

            token = parser.parseNextToken();
        }
    }
}

===

Note that the hidden list and most of this code works on the IDs for the
layers, not their names. I actually have no idea where the ID comes from -
whether PDFBox comes up with them, or whether they are in the PDF.
Regardless, you have to build a translation of the layer description to the
id for any of this to be useful. The IDs for my example ran MC0, MC1, MCn.
Note that the index order is not the same as the order the optional content
groups enumerate - I tried that first, ha ha. You actually have to build
the index. If yours happen to line up it is just a coincidence.

The layers can be enumerated like this:

        PDPage zero = doc.getPage(idx);
        PDResources res = zero.getResources();

        for (COSName propName : res.getPropertiesNames()) {
            // Fragile - can other kinds be found here?
            PDOptionalContentGroup mc = (PDOptionalContentGroup)
res.getProperties(propName);
            logger.info("Prop: {} {}", propName, mc.getName());
        }

So for example if you see that you want to hide MC2, put those in a HashSet
and pass them into a new FilteringRenderer. That looks something like this:

        HashSet<String> hidden = new HashSet<>();
        // Suppress the ortho
        hidden.add("MC2");

        FilteringRenderer renderer = new FilteringRenderer(doc, hidden);
        BufferedImage map = renderer.renderImageWithDPI(0, dpi);

I sincerely hope this helps someone facing this same challenge.

Thanks for listening!

-- 
C. Alexander Leigh

Re: Suppressing layers on output

Posted by "C. Alexander Leigh" <a...@a6v.org>.
Tilman,

I don't mind doing a little work for you to integrate it into the trunk but
I have some reservations; largely, I am not sure that I implemented it
properly with regards to the PDF spec. I know it works for the GeoPDFs I
have been working on (and unrelated to the layers I have a pile of code
that sifts out the GeoPDF encoding and does the translations), but of
course that doesn't mean I did it right. For example, I find it odd that I
didn't have to implement BMC.

I think that you do have to be concerned about the hierarchy. At least from
an end-user perspective in Acrobat if you hide a parent layer all the
children layers become hidden. So I think hiding the parent layer in the
library should cause the same result. Often in a PDF the parent layer is
only there for grouping purposes and itself has no content.

What I couldn't say for certain is if you ever get a PDF that is encoded
outside of my expectations. For example, given sibling layers A and B, will
you ever get a PDF where content for B appears inside content for A. In the
reference PDFs I have the answer is no but it seems that is possible in
physics - being unfamiliar with the actual PDF spec I don't know if this is
a contractual thing or what.

I wrote the code to be that you mark the hidden entries mostly because that
was my mentality. It offers a slight economy because the code that
determines if the current stack is hidden searches a smaller list (the
current stack vs. the list of all the layers). If you flipped it around you
would only know if you were hiding if every entry in the current stack was
present in the keep list.

What I really wanted was to key off the true/false status in the optional
content groups (the API) but I didn't have the understanding of the library
to figure out how to get into that at the right time. I also would have
really liked to have the index name (MC1,etc) be present in the OCG class.
Funny enough that is what cost me the most amount of time, figuring out how
to get a map between them. It's all their of course, it just was not
obvious to me.

To be honest though all this is a trivial change to the library. Other than
BDP & EMP which are nearly empty classes, the only real thing that matters
was processStreamOperators() which is only a handful of lines different. I
would think you could go in and make these changes in a few minutes and
probably be quite satisfied with the results. Most of the code in my
example is really just dealing with the fact I extended your classes rather
than fork the project. To do a full implementation though I would suggest
extending the OCG pojos to include the index name and use the true/false
value there to drive the code.

That's what I did when I first got the library - I set some OCG layers to
false in the POJO and then became surprised when they still rendered. So I
think that's the natural way to use the library vs. building some other
list of things you want to keep or get rid of.

-cal


On Thu, Jun 16, 2016 at 9:09 AM, Tilman Hausherr <TH...@t-online.de>
wrote:

> I'd be interested in this, but not as an extension, rather as a change in
> the trunk. Our two main java competitors both can do it.
>
> I'd prefer that the PDFRenderer and PageDrawer remain as they are, but
> that the PDFRenderer gets a new setter for the map, e.g. setOCG() or
> setOCGMap(). This map would then be passed in PageDrawerParameters. Btw
> shouldn't we put into the map what we want to keep? Is there a good
> argument to do it one way or the other? Do we have to care about the
> hierarchy?
>
> The IDs are in the PDF... open a file with the PDFDebugger-App and you'll
> see :-)
>
> There is a file with OCGs at PDFBox
> reactor\pdfbox\target\test-output\ocg-generation.pdf
>
> Other files with layers:
> - gs-bugzilla689931.pdf
> - gs-bugzilla693548-STACKOVERFLOW.pdf
> - gs-bugzilla692933.pdf
> - many maps at http://ims.er.usgs.gov/images/USTopo/ (the huge files)
>
> So if you're willing to change your code accordingly, please open an issue
> in JIRA, and attach your code as a patch / diff.
>
> https://issues.apache.org/jira/browse/PDFBOX
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>


-- 
C. Alexander Leigh

Re: Suppressing layers on output

Posted by John Hewson <jo...@jahewson.com>.
> On 19 Jun 2016, at 08:04, Andreas Lehmkuehler <an...@lehmi.de> wrote:
> 
> Am 19.06.2016 um 16:11 schrieb Tilman Hausherr:
>> Am 19.06.2016 um 08:52 schrieb John Hewson:
>>>>> >>JIRA, and attach your code as a patch / diff.
>>>> >There is already some code handling those operators, see
>>>> PDFMarkedContentExtractor. It could be moved to a more generic place so that
>>>> we have to add some filtering only.
>>> Yes, that's is the proper way to handle this. Operators are handled with a an
>>> OperatorProcessor, not my modifying the parser (e.g. processStreamOperators).
>>> Better yet, we already have the code to handle BMC/EMC. All that is needed is
>>> for PDFRenderer to add a constructor which accepts a list of layer names to
>>> render, which are then passed as part of PageDrawerParmeters.
>> 
>> The problem is that these two operators influence whether or not all the other
>> tokens in the content stream are used or not. So the method by C. makes sense to
>> me.  The alternative would be to alter every operator processor to check whether
>> it is relevant or not.
>> Or they would have to be extended from some common class that does this check.

The alternative is actually really simple. The parser should no be responsible for high-level
processing such as this. It’s the job of an OperatorProcessor to handle how operators are
processed, and of PDFStreamEngine to handle the actual work - that’s the core of our
subclassing & extensibility model for PDFBox.

So take the view that BMC and EMC don’t affect the tokens, they affect rendering. We should still process the
tokens as normal and have BMC and EMC set a flag on PageDrawer (or one of its superclasses)
which indicates which layer is currently being processed. The PageDrawer can then decide what to do
with this information - namely check in strokePath, fillPath, fillAndStrokePath, and drawImage whether
or not to suppress rendering. No need to extend any OperatorProcessor’s.

I’ve explained how this would be done for PageDrawer , but i t might be better to do all of this in
PDFStreamEngine rather than PageDrawer, as then other subclasses can benefit form this functionality.

>> PDFMarkedContentExtractor is not really helpful. Here's some code to show what
>> it does - it shows the objects that belong to a specific group. The output
>> cannot be used for rendering.
> Maybe there is a misunderstanding. We need to track the current layer and the stack of all current layers. C. provided some code doing that and we already have some code doing it (I'm talking about the operators in org.apache.pdfbox.contentstream.operator.markedcontent). What is missing is some sort of filter based on that information.

Exactly, PDFMarkedContentExtractor already contains implementations of the necessary OperatorProcessor’s. We just need to move them into separate files, and as you say, add some sort of filter in PDFStreamEngine / PageDrawer.

> BR
> Andreas
>> 
>> 
>> import java.io.File;
>> import java.io.IOException;
>> import java.util.Arrays;
>> import java.util.List;
>> import org.apache.pdfbox.cos.COSName;
>> import org.apache.pdfbox.pdmodel.PDDocument;
>> import org.apache.pdfbox.pdmodel.PDPage;
>> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
>> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
>> import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
>> import
>> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
>> import org.apache.pdfbox.text.PDFMarkedContentExtractor;
>> 
>> public class ExtractMarkedContent extends PDFMarkedContentExtractor
>> {
>> 
>>    public ExtractMarkedContent() throws IOException
>>    {
>>    }
>> 
>>    public static void main(String[] args) throws IOException
>>    {
>> 
>>       PDDocument doc = PDDocument.load(new File("C......\\PDFBox
>> reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
>>        PDOptionalContentProperties ocp =
>> doc.getDocumentCatalog().getOCProperties();
>>        System.out.println("Group names in document catalog: " +
>> Arrays.toString(ocp.getGroupNames()));
>>        for (String groupName : ocp.getGroupNames())
>>        {
>>            PDOptionalContentGroup group = ocp.getGroup(groupName);
>>            System.out.println(group.getCOSObject());
>>        }
>>        ExtractMarkedContent extractMarkedContent = new ExtractMarkedContent();
>>        PDPage page = doc.getPage(0);
>>        System.out.println("Property names in page resources: " +
>> page.getResources().getPropertiesNames());
>>        extractMarkedContent.processPage(page);
>>        List<PDMarkedContent> markedContents =
>> extractMarkedContent.getMarkedContents();
>>        System.out.println("Extracted contents: ");
>>        for (PDMarkedContent mc : markedContents)
>>        {
>>            PDPropertyList propertyList =
>> page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
>>            String propName = propertyList.getCOSObject().getString(COSName.NAME);
>>            System.out.println(mc.getTag() + " (" + propName + "): " +
>> mc.getContents());
>>        }
>>        doc.close();
>>    }
>> }
>> 
>> 
>> The output is:
>> 
>> Group names in document catalog: [background, enabled, disabled]
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{background}) }
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{enabled}) }
>> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{disabled}) }
>> Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
>> Extracted contents:
>> oc1 (background): [P, D, F,  , 1, ., 5, :,  , O, p, t, i, o, n, a, l,  , C, o,
>> n, t, e, n, t,  , G, r, o, u, p, s, Y, o, u,  , s, h, o, u, l, d,  , s, e, e,  ,
>> a,  , g, r, e, e, n,  , t, e, x, t, l, i, n, e, ,,  , b, u, t,  , n, o,  , r, e,
>> d,  , t, e, x, t,  , l, i, n, e, .]
>> oc2 (enabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a, n,  , e, n, a, b, l,
>> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
>> t, h, a, t, ', s,  , g, o, o, d, .]
>> oc3 (disabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a,  , d, i, s, a, b, l,
>> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
>> t, h, a, t, ', s,  , N, O, T,  , g, o, o, d, !]
>> 
>> 
>> 
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org <ma...@pdfbox.apache.org>
> For additional commands, e-mail: users-help@pdfbox.apache.org <ma...@pdfbox.apache.org>

Re: Suppressing layers on output

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 19.06.2016 um 16:11 schrieb Tilman Hausherr:
> Am 19.06.2016 um 08:52 schrieb John Hewson:
>>>> >>JIRA, and attach your code as a patch / diff.
>>> >There is already some code handling those operators, see
>>> PDFMarkedContentExtractor. It could be moved to a more generic place so that
>>> we have to add some filtering only.
>> Yes, that's is the proper way to handle this. Operators are handled with a an
>> OperatorProcessor, not my modifying the parser (e.g. processStreamOperators).
>> Better yet, we already have the code to handle BMC/EMC. All that is needed is
>> for PDFRenderer to add a constructor which accepts a list of layer names to
>> render, which are then passed as part of PageDrawerParmeters.
>
> The problem is that these two operators influence whether or not all the other
> tokens in the content stream are used or not. So the method by C. makes sense to
> me.  The alternative would be to alter every operator processor to check whether
> it is relevant or not.
> Or they would have to be extended from some common class that does this check.
>
> PDFMarkedContentExtractor is not really helpful. Here's some code to show what
> it does - it shows the objects that belong to a specific group. The output
> cannot be used for rendering.
Maybe there is a misunderstanding. We need to track the current layer and the 
stack of all current layers. C. provided some code doing that and we already 
have some code doing it (I'm talking about the operators in 
org.apache.pdfbox.contentstream.operator.markedcontent). What is missing is some 
sort of filter based on that information.

BR
Andreas
>
>
> import java.io.File;
> import java.io.IOException;
> import java.util.Arrays;
> import java.util.List;
> import org.apache.pdfbox.cos.COSName;
> import org.apache.pdfbox.pdmodel.PDDocument;
> import org.apache.pdfbox.pdmodel.PDPage;
> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
> import org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
> import org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
> import
> org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
> import org.apache.pdfbox.text.PDFMarkedContentExtractor;
>
> public class ExtractMarkedContent extends PDFMarkedContentExtractor
> {
>
>     public ExtractMarkedContent() throws IOException
>     {
>     }
>
>     public static void main(String[] args) throws IOException
>     {
>
>        PDDocument doc = PDDocument.load(new File("C......\\PDFBox
> reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
>         PDOptionalContentProperties ocp =
> doc.getDocumentCatalog().getOCProperties();
>         System.out.println("Group names in document catalog: " +
> Arrays.toString(ocp.getGroupNames()));
>         for (String groupName : ocp.getGroupNames())
>         {
>             PDOptionalContentGroup group = ocp.getGroup(groupName);
>             System.out.println(group.getCOSObject());
>         }
>         ExtractMarkedContent extractMarkedContent = new ExtractMarkedContent();
>         PDPage page = doc.getPage(0);
>         System.out.println("Property names in page resources: " +
> page.getResources().getPropertiesNames());
>         extractMarkedContent.processPage(page);
>         List<PDMarkedContent> markedContents =
> extractMarkedContent.getMarkedContents();
>         System.out.println("Extracted contents: ");
>         for (PDMarkedContent mc : markedContents)
>         {
>             PDPropertyList propertyList =
> page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
>             String propName = propertyList.getCOSObject().getString(COSName.NAME);
>             System.out.println(mc.getTag() + " (" + propName + "): " +
> mc.getContents());
>         }
>         doc.close();
>     }
> }
>
>
> The output is:
>
> Group names in document catalog: [background, enabled, disabled]
> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{background}) }
> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{enabled}) }
> COSDictionary{(COSName{Type}:COSName{OCG}) (COSName{Name}:COSString{disabled}) }
> Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
> Extracted contents:
> oc1 (background): [P, D, F,  , 1, ., 5, :,  , O, p, t, i, o, n, a, l,  , C, o,
> n, t, e, n, t,  , G, r, o, u, p, s, Y, o, u,  , s, h, o, u, l, d,  , s, e, e,  ,
> a,  , g, r, e, e, n,  , t, e, x, t, l, i, n, e, ,,  , b, u, t,  , n, o,  , r, e,
> d,  , t, e, x, t,  , l, i, n, e, .]
> oc2 (enabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a, n,  , e, n, a, b, l,
> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
> t, h, a, t, ', s,  , g, o, o, d, .]
> oc3 (disabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a,  , d, i, s, a, b, l,
> e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, h, i, s, ,,  ,
> t, h, a, t, ', s,  , N, O, T,  , g, o, o, d, !]
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Suppressing layers on output

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 19.06.2016 um 08:52 schrieb John Hewson:
>>> >>JIRA, and attach your code as a patch / diff.
>> >There is already some code handling those operators, see PDFMarkedContentExtractor. It could be moved to a more generic place so that we have to add some filtering only.
> Yes, that's is the proper way to handle this. Operators are handled with a an OperatorProcessor, not my modifying the parser (e.g. processStreamOperators). Better yet, we already have the code to handle BMC/EMC. All that is needed is for PDFRenderer to add a constructor which accepts a list of layer names to render, which are then passed as part of PageDrawerParmeters.

The problem is that these two operators influence whether or not all the 
other tokens in the content stream are used or not. So the method by C. 
makes sense to me.  The alternative would be to alter every operator 
processor to check whether it is relevant or not.
Or they would have to be extended from some common class that does this 
check.

PDFMarkedContentExtractor is not really helpful. Here's some code to 
show what it does - it shows the objects that belong to a specific 
group. The output cannot be used for rendering.


import java.io.File;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import org.apache.pdfbox.cos.COSName;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDMarkedContent;
import 
org.apache.pdfbox.pdmodel.documentinterchange.markedcontent.PDPropertyList;
import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentGroup;
import 
org.apache.pdfbox.pdmodel.graphics.optionalcontent.PDOptionalContentProperties;
import org.apache.pdfbox.text.PDFMarkedContentExtractor;

public class ExtractMarkedContent extends PDFMarkedContentExtractor
{

     public ExtractMarkedContent() throws IOException
     {
     }

     public static void main(String[] args) throws IOException
     {

        PDDocument doc = PDDocument.load(new File("C......\\PDFBox 
reactor\\pdfbox\\target\\test-output","ocg-generation.pdf"));
         PDOptionalContentProperties ocp = 
doc.getDocumentCatalog().getOCProperties();
         System.out.println("Group names in document catalog: " + 
Arrays.toString(ocp.getGroupNames()));
         for (String groupName : ocp.getGroupNames())
         {
             PDOptionalContentGroup group = ocp.getGroup(groupName);
             System.out.println(group.getCOSObject());
         }
         ExtractMarkedContent extractMarkedContent = new 
ExtractMarkedContent();
         PDPage page = doc.getPage(0);
         System.out.println("Property names in page resources: " + 
page.getResources().getPropertiesNames());
         extractMarkedContent.processPage(page);
         List<PDMarkedContent> markedContents = 
extractMarkedContent.getMarkedContents();
         System.out.println("Extracted contents: ");
         for (PDMarkedContent mc : markedContents)
         {
             PDPropertyList propertyList = 
page.getResources().getProperties(COSName.getPDFName(mc.getTag()));
             String propName = 
propertyList.getCOSObject().getString(COSName.NAME);
             System.out.println(mc.getTag() + " (" + propName + "): " + 
mc.getContents());
         }
         doc.close();
     }
}


The output is:

Group names in document catalog: [background, enabled, disabled]
COSDictionary{(COSName{Type}:COSName{OCG}) 
(COSName{Name}:COSString{background}) }
COSDictionary{(COSName{Type}:COSName{OCG}) 
(COSName{Name}:COSString{enabled}) }
COSDictionary{(COSName{Type}:COSName{OCG}) 
(COSName{Name}:COSString{disabled}) }
Property names in page resources: [COSName{oc1}, COSName{oc2}, COSName{oc3}]
Extracted contents:
oc1 (background): [P, D, F,  , 1, ., 5, :,  , O, p, t, i, o, n, a, l,  , 
C, o, n, t, e, n, t,  , G, r, o, u, p, s, Y, o, u,  , s, h, o, u, l, d,  
, s, e, e,  , a,  , g, r, e, e, n,  , t, e, x, t, l, i, n, e, ,,  , b, 
u, t,  , n, o,  , r, e, d,  , t, e, x, t,  , l, i, n, e, .]
oc2 (enabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a, n,  , e, n, a, 
b, l, e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , t, 
h, i, s, ,,  , t, h, a, t, ', s,  , g, o, o, d, .]
oc3 (disabled): [T, h, i, s,  , i, s,  , f, r, o, m,  , a,  , d, i, s, 
a, b, l, e, d,  , l, a, y, e, r, .,  , I, f,  , y, o, u,  , s, e, e,  , 
t, h, i, s, ,,  , t, h, a, t, ', s,  , N, O, T,  , g, o, o, d, !]




Re: Suppressing layers on output

Posted by John Hewson <jo...@jahewson.com>.
> On 16 Jun 2016, at 12:56, Andreas Lehmkuehler <an...@lehmi.de> wrote:
> 
> Am 16.06.2016 um 18:09 schrieb Tilman Hausherr:
>> I'd be interested in this, but not as an extension, rather as a change in the
>> trunk. Our two main java competitors both can do it.
>> 
>> I'd prefer that the PDFRenderer and PageDrawer remain as they are, but that the
>> PDFRenderer gets a new setter for the map, e.g. setOCG() or setOCGMap(). This
>> map would then be passed in PageDrawerParameters. Btw shouldn't we put into the
>> map what we want to keep? Is there a good argument to do it one way or the
>> other? Do we have to care about the hierarchy?
>> 
>> The IDs are in the PDF... open a file with the PDFDebugger-App and you'll see :-)
>> 
>> There is a file with OCGs at PDFBox
>> reactor\pdfbox\target\test-output\ocg-generation.pdf
>> 
>> Other files with layers:
>> - gs-bugzilla689931.pdf
>> - gs-bugzilla693548-STACKOVERFLOW.pdf
>> - gs-bugzilla692933.pdf
>> - many maps at http://ims.er.usgs.gov/images/USTopo/ (the huge files)
>> 
>> So if you're willing to change your code accordingly, please open an issue in
>> JIRA, and attach your code as a patch / diff.
> There is already some code handling those operators, see PDFMarkedContentExtractor. It could be moved to a more generic place so that we have to add some filtering only.

Yes, that's is the proper way to handle this. Operators are handled with a an OperatorProcessor, not my modifying the parser (e.g. processStreamOperators). Better yet, we already have the code to handle BMC/EMC. All that is needed is for PDFRenderer to add a constructor which accepts a list of layer names to render, which are then passed as part of PageDrawerParmeters.

— John

> BR
> Andreas
>> https://issues.apache.org/jira/browse/PDFBOX
>> 
>> Tilman
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Suppressing layers on output

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 16.06.2016 um 18:09 schrieb Tilman Hausherr:
> I'd be interested in this, but not as an extension, rather as a change in the
> trunk. Our two main java competitors both can do it.
>
> I'd prefer that the PDFRenderer and PageDrawer remain as they are, but that the
> PDFRenderer gets a new setter for the map, e.g. setOCG() or setOCGMap(). This
> map would then be passed in PageDrawerParameters. Btw shouldn't we put into the
> map what we want to keep? Is there a good argument to do it one way or the
> other? Do we have to care about the hierarchy?
>
> The IDs are in the PDF... open a file with the PDFDebugger-App and you'll see :-)
>
> There is a file with OCGs at PDFBox
> reactor\pdfbox\target\test-output\ocg-generation.pdf
>
> Other files with layers:
> - gs-bugzilla689931.pdf
> - gs-bugzilla693548-STACKOVERFLOW.pdf
> - gs-bugzilla692933.pdf
> - many maps at http://ims.er.usgs.gov/images/USTopo/ (the huge files)
>
> So if you're willing to change your code accordingly, please open an issue in
> JIRA, and attach your code as a patch / diff.
There is already some code handling those operators, see 
PDFMarkedContentExtractor. It could be moved to a more generic place so that we 
have to add some filtering only.

BR
Andreas
> https://issues.apache.org/jira/browse/PDFBOX
>
> Tilman
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: Suppressing layers on output

Posted by Tilman Hausherr <TH...@t-online.de>.
I'd be interested in this, but not as an extension, rather as a change 
in the trunk. Our two main java competitors both can do it.

I'd prefer that the PDFRenderer and PageDrawer remain as they are, but 
that the PDFRenderer gets a new setter for the map, e.g. setOCG() or 
setOCGMap(). This map would then be passed in PageDrawerParameters. Btw 
shouldn't we put into the map what we want to keep? Is there a good 
argument to do it one way or the other? Do we have to care about the 
hierarchy?

The IDs are in the PDF... open a file with the PDFDebugger-App and 
you'll see :-)

There is a file with OCGs at PDFBox 
reactor\pdfbox\target\test-output\ocg-generation.pdf

Other files with layers:
- gs-bugzilla689931.pdf
- gs-bugzilla693548-STACKOVERFLOW.pdf
- gs-bugzilla692933.pdf
- many maps at http://ims.er.usgs.gov/images/USTopo/ (the huge files)

So if you're willing to change your code accordingly, please open an 
issue in JIRA, and attach your code as a patch / diff.

https://issues.apache.org/jira/browse/PDFBOX

Tilman

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org