You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Hoss Man (JIRA)" <ji...@apache.org> on 2017/10/31 00:38:00 UTC

[jira] [Updated] (SOLR-10934) create a link+anchor checker for the ref-guide PDF using PDFBox

     [ https://issues.apache.org/jira/browse/SOLR-10934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man updated SOLR-10934:
----------------------------
    Attachment: SOLR-10934.patch



Ok, I'm attaching a really rough and dirty patch that includes:

* A quick and dirty CheckPDFLinksAndAnchors inspired by the SO post mentioned and the original PrintURLs.java demo from pdfbox
* a build.xml 'nocommit' target to run it against our PDF
* some "broken" changes to our ref-guide content to deliberatey introduce a few errors...
*# anchor duplicated in multiple source pages
*# links to each of the diff dup anchors
*# link to an anchor that doesn't exist in the specified source doc, but does exist in a diff doc
*# links to an source doc thta doesn't exist
*# links to an anchor that doesn't exist (in a source doc that does)

The results aren't promising...

# FAIL: the dup anchors cause asciidoctor to print a WARNING (even w/o any link checking) that i'd forgotten about, but as far as i can tell from my exploration of the {{PDDocumentCatalog}} that duplicated information is lost in the underlying PDF (or if it does make it into the PDF, PDFBox loses it when parsing the PDF, because the "Catalog" is just a Map)
# FAIL: the PDF Annotations to each of the dup links both wind up mapping to the page with the first occurange -- again: either because the catalog in the file can only track one location for a given anchor, or because that's just how PDF Box deals with the precedence of dup dict keys when reading the file
# FAIL: if an anchor doesn't exist in the specified source {{\*.adoc}} file, but does exist somehwere else in the final PDF, then that's where asciidoctor points the generated link -- there's nothing weird about it i can detect from PDFBox
# GOOD: link's to a source {{\*.adoc}} file that doesn't actaully exist on disk are fairly easy to detect -- asciidoctor's default behavior is to assume that these are links to other docs that will be converted seperately, so they show up as "relative URIs" which we can treat as a failure (ie: if a link in a PDF is to a non-absolute URI, it must be a content error)
# GOOD: link's to an anchor that doesn't exist are likewise easy to identify: the "annotation" is preserved but has no destiation, which we can treat as a failure.

The important bits of the output w/this patch are included below...

{noformat}
-build-raw-pdf:
[asciidoctor:convert] Render SolrRefGuide-all.adoc from /home/hossman/lucene/dev/solr/build/solr-ref-guide/content/pdf to /home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp with backend=pdf
[asciidoctor:convert] asciidoctor: ERROR: about-this-guide.adoc: line 1: invalid part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: ERROR: solr-glossary.adoc: line 1: invalid part, must have at least one section (e.g., chapter, appendix, etc.)
[asciidoctor:convert] asciidoctor: WARNING: errata.adoc: line 30: id assigned to section already in use: nocommit_dup_anchor_name
[asciidoctor:convert] asciidoctor: ERROR: SolrRefGuide-all.adoc: line 37: invalid part, must have at least one section (e.g., chapter, appendix, etc.)
     [move] Moving 1 file to /home/hossman/lucene/dev/solr/build/solr-ref-guide/pdf-tmp
...
nocommit:
     [java] Page 753:'Link to bogus page @ anchor that does not exist'=> BOGUS URI: nocommit_bogus_page.pdf#nocommit_bogus_x2
     [java] Page 753:'Link to about @ anchor that does not exist' => link with no page dest

{noformat}

----

All in all these results are disappointing.

The "Single Page" output behavior of asciidoctor, combined with the "bugs" in asciidoctors handling of duplicated anchors in page includes, combined with the underlying structure of the PDF, make it really hard to find the same types of failures we can find when parsing the jekyll generated pages using our white-box knowledge of "there must be no dup anchors across all pages"


> create a link+anchor checker for the ref-guide PDF using PDFBox
> ---------------------------------------------------------------
>
>                 Key: SOLR-10934
>                 URL: https://issues.apache.org/jira/browse/SOLR-10934
>             Project: Solr
>          Issue Type: Sub-task
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: documentation
>            Reporter: Hoss Man
>         Attachments: SOLR-10934.patch
>
>
> We currently have CheckLinksAndAnchors.java which is automatically run against the ref-guide HTML as part of the build to use JSoup to find bad links/anchors that asciidoctor doesn't complain about -- but not everyone does/can build the HTML version of the ref-guide sincif we can e it requires manually installing jekyll.
> The PDF build only requires things installed by ivy (via JRuby) and we already have some PDFBox based code in ReducePDFSize.java that operates on this PDF every time it's run -- so if we can find a way to do similar checks using the PDFBox API we could catch these broken links faster.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org