You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@whimsical.apache.org by Shane Curcuru <as...@shanecurcuru.org> on 2018/04/11 13:46:06 UTC

[PROPOSAL] improvements to site-scan.rb/site.cgi

I'd like to simplify some of the site-scan.rb/site.cgi processing by
centralizing some of the core things that the scripts are searching for
into site-scan.rb.  While I appreciate the original design motivation,
we currently have duplicate regexes - and we have more people interested
in using the results of the site scan (esp. with events) and officers
potentially requesting changes to the requirements.

Roughly, I'd like to move most of CHECKS into site-scan.rb for
simplicity and use those to implement most of the link scans.  Some of
the scans still have more logic (which would still be custom), but some
of them can be mechanical.

CHECKS = {
  'events'      =>
    [
      '',
      # a_text regex to scan for - for events, we don't care, so blank
      '/apache.org/events',
      # a_href minimal regex to capture - for events, this tells us what
link to capture from the page
      %r{^https?://.*apache.org/events/current-event}
      # a_href full regex to expect for compliance (used in site.cgi)
    ],

  'license'      =>
    [
      '/licenses?/',
      # a_text regex to scan for - for license, this is required
      'apache.org',
      # a_href minimal regex to capture - for license, we only capture
the link if it points to apache.org
      %r{^https?://.*apache.org/licenses/$}
      # a_href full regex to expect for compliance; it must point to one
of our actual licenses to pass
    ],
...etc.
}

Any overall objections?  It's making me twitchy seeing most of the
regexes we use for scanning in separate places.

--

- Shane
  Director & Member
  The Apache Software Foundation

Re: [PROPOSAL] improvements to site-scan.rb/site.cgi

Posted by Shane Curcuru <as...@shanecurcuru.org>.
Also - any objection to taking most of the code from www/pods.cgi out
and replacing with shared methods from site.cgi somehow?  Obviously
pods.cgi needs to display some different text, and may have additional
checks, but really most of the code should be identical to site.cgi now
that the scanning is done all by site-scan.rb.

-- 

- Shane
  Director & Member
  The Apache Software Foundation

Re: [PROPOSAL] improvements to site-scan.rb/site.cgi

Posted by Sam Ruby <ru...@intertwingly.net>.
One other consideration: it probably continues to make sense for the
CGI to describe the check that is being made to the end user.  A
regular expression is barely adequate for that, but better than no
indication.

- Sam Ruby

On Wed, Apr 11, 2018 at 9:46 AM, Shane Curcuru <as...@shanecurcuru.org> wrote:
> I'd like to simplify some of the site-scan.rb/site.cgi processing by
> centralizing some of the core things that the scripts are searching for
> into site-scan.rb.  While I appreciate the original design motivation,
> we currently have duplicate regexes - and we have more people interested
> in using the results of the site scan (esp. with events) and officers
> potentially requesting changes to the requirements.
>
> Roughly, I'd like to move most of CHECKS into site-scan.rb for
> simplicity and use those to implement most of the link scans.  Some of
> the scans still have more logic (which would still be custom), but some
> of them can be mechanical.
>
> CHECKS = {
>   'events'      =>
>     [
>       '',
>       # a_text regex to scan for - for events, we don't care, so blank
>       '/apache.org/events',
>       # a_href minimal regex to capture - for events, this tells us what
> link to capture from the page
>       %r{^https?://.*apache.org/events/current-event}
>       # a_href full regex to expect for compliance (used in site.cgi)
>     ],
>
>   'license'      =>
>     [
>       '/licenses?/',
>       # a_text regex to scan for - for license, this is required
>       'apache.org',
>       # a_href minimal regex to capture - for license, we only capture
> the link if it points to apache.org
>       %r{^https?://.*apache.org/licenses/$}
>       # a_href full regex to expect for compliance; it must point to one
> of our actual licenses to pass
>     ],
> ...etc.
> }
>
> Any overall objections?  It's making me twitchy seeing most of the
> regexes we use for scanning in separate places.
>
> --
>
> - Shane
>   Director & Member
>   The Apache Software Foundation

Re: [PROPOSAL] improvements to site-scan.rb/site.cgi

Posted by Shane Curcuru <as...@shanecurcuru.org>.
New architecture:

- lib/whimsy/sitestandards.rb defines hashes for all types of checks as
regexes.  In most (but not all cases), the site-scan.rb simply uses the
CHECK_CAPTURE regex to determine what a_href|a_text to capture (i.e. put
into the site-scan.json), as well as various utility functions to ease
finding tlps vs. podlings.

The original design thought is:
* CHECK_CAPTURE is a lax/broad regex that would be used to define the
text|link we want to store.  This will capture some items that might not
strictly meet a requirement as spec'd, but are close.

* CHECK_TEXT|CHECK_VALIDATE (for links) would be used in UI display, to
more strictly validate if a captured value is either SITE_PASS or
SITE_WARN.  This is useful, because we've later updated the capture
(like below for security) when it's clear that some related text is
actually good enough to pass IMO.

- lib/whimsy/sitewebsite.rb defines 90% of the UI code to display data.

- www/site.cgi|pods.cgi are now just text output for descriptions and
calls to SiteWebsite to display the data, and should act exactly the
same.  The podling version also has a link to the podling status page.

- tools/site-scan.rb is commented and reorganized, with changes:

* most checks now use SiteStandards regexes (but not all)

* see the USAGE: minor update for improved functionality

Overall this changes 18 websites to now SITE_PASS 'security' check
instead of SITE_FAIL, primarily because it allows a_text of "Security
Reports" instead of just "Security", which I think is justified since it
certainly is clear to users the purpose of the link.

-
-- 

- Shane
  Director & Member
  The Apache Software Foundation