You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Eugeny N Dzhurinsky <bo...@redwerk.com> on 2009/01/18 15:30:55 UTC

Hidden links in PDF document?

Hello!

We are using PDFBox 0.7.3 to extract meta-information, including web links,
from PDF documents.  Recently we found for some PDF documents PDFBox is able
to find some links, while they are not displayed in Acrobat Reader or XPDF.

For example: http://www.pmi.org/PDF/PMI%20Professional%20Awards%20History.pdf
does not contain any of visible links, but PDFBox is able to find some.

I've created the simple unit test to illustrate the issue:

==================================================================================
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Set;

import junit.framework.TestCase;

import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.pdmodel.PDDocument;
import org.pdfbox.pdmodel.PDPage;
import org.pdfbox.pdmodel.interactive.action.type.PDAction;
import org.pdfbox.pdmodel.interactive.action.type.PDActionURI;
import org.pdfbox.pdmodel.interactive.annotation.PDAnnotation;
import org.pdfbox.pdmodel.interactive.annotation.PDAnnotationLink;


public class PDFBoxLinkExtractionTest extends TestCase {

    public void testHiddenLinksPresence() throws Exception {
        PDFParser parser = new PDFParser(PDFBoxLinkExtractionTest.class
                .getResourceAsStream("/PMI Professional Awards History.pdf"));
        parser.parse();
        PDDocument doc = parser.getPDDocument();
        List pages = null;
            pages = doc.getDocumentCatalog().getAllPages();
        if (pages == null || pages.isEmpty())
            return;
        Set foundLinks = new HashSet();
        for (final Iterator it = pages.iterator(); it.hasNext(); ) {
            final PDPage page = (PDPage) it.next();
            final List annotations = page.getAnnotations();
            for (final Iterator jt = annotations.iterator(); jt.hasNext();) {
                final PDAnnotation annot = (PDAnnotation) jt.next();
                if (annot instanceof PDAnnotationLink) {
                    final PDAnnotationLink link = (PDAnnotationLink) annot;
                    final PDAction action = link.getAction();
                    if (action instanceof PDActionURI) {
                        final PDActionURI uri = (PDActionURI) action;
                        final String strURI = uri.getURI();
                        if (!foundLinks.contains(strURI))
                            foundLinks.add(strURI);
                    }
                }
            }
        }
        assertTrue("Expected no links, but found " + foundLinks, foundLinks
                .isEmpty());
    }

}
==================================================================================

Can somebody please explain how is it possible to throw away the invisible
links when parsing a PDF in a way as described in the unit test above?

Thank you in advance!

-- 
Eugene N Dzhurinsky