You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@whimsical.apache.org by sebb <se...@gmail.com> on 2017/05/09 10:08:10 UTC

site scan algorithm and output data

The site scanner currently looks for specific links *or* specific text.

This does not always work well, e.g. httpd uses 'Sponsors' for the
'Thanks' link, so it appears to have no link rather than one with an
'incorrect' name.

I think it would be better to search for both the expected text and
the expected link, and record any matches for either.

Probably the search targets should also be recorded in the analysis output.
This should make it easier for the analysis to report what was expected.

for example:

httpd: {
   ...
   sponsorship: {
      text: {
        expected: "Thanks",
        found: ["http://.../"]
      },
      link: {
        expected: "http://...",
        found: ["Sponsors"]
      },
  }
}


Obviously this would mean changes to the analysis as well.

Thoughts?

Re: site scan algorithm and output data

Posted by Shane Curcuru <as...@shanecurcuru.org>.

sebb wrote on 5/9/17 6:08 AM:
> The site scanner currently looks for specific links *or* specific text.
> 
> This does not always work well, e.g. httpd uses 'Sponsors' for the
> 'Thanks' link, so it appears to have no link rather than one with an
> 'incorrect' name.
> 
> I think it would be better to search for both the expected text and
> the expected link, and record any matches for either.

Agreed.

Note that the analysis step is never likely going to be 100% accurate,
since the current policy is written with the intent in mind, not a
specific formula.  But you're right: looking for, and also storing scan
data for both links and text is a great way to improve results.

Separately, I do think having an "approved exceptions" list is an easier
way to improve results in some cases rather than funkier regexes or the
like.  See concept in "Re: Rename site-check.rb => site-scan.rb?", but
improved to match your additions here:

site-exceptions.json
{
  "axis": {
    "trademarks": { :allowed_string "Trademark Registered of The ASF" },
    "events": { :allowed_url "http://www.apache.org/special-event" }
  },
  ...
}

> 
> Probably the search targets should also be recorded in the analysis output.
> This should make it easier for the analysis to report what was expected.
> 
> for example:
> 
> httpd: {
>    ...
>    sponsorship: {
>       text: {
>         expected: "Thanks",
>         found: ["http://.../"]
>       },
>       link: {
>         expected: "http://...",
>         found: ["Sponsors"]
>       },
>   }
> }
> 
> 
> Obviously this would mean changes to the analysis as well.
> 
> Thoughts?
> 


-- 

- Shane
  https://www.apache.org/foundation/marks/resources

Re: site scan algorithm and output data

Posted by Sam Ruby <ru...@intertwingly.net>.

On Tue, May 9, 2017 at 6:08 AM, sebb <se...@gmail.com> wrote:
> The site scanner currently looks for specific links *or* specific text.
>
> This does not always work well, e.g. httpd uses 'Sponsors' for the
> 'Thanks' link, so it appears to have no link rather than one with an
> 'incorrect' name.
>
> I think it would be better to search for both the expected text and
> the expected link, and record any matches for either.
>
> Probably the search targets should also be recorded in the analysis output.
> This should make it easier for the analysis to report what was expected.
>
> for example:
>
> httpd: {
>    ...
>    sponsorship: {
>       text: {
>         expected: "Thanks",
>         found: ["http://.../"]
>       },
>       link: {
>         expected: "http://...",
>         found: ["Sponsors"]
>       },
>   }
> }
>
>
> Obviously this would mean changes to the analysis as well.
>
> Thoughts?

If you would be willing to change both scripts to match... go for it!

- Sam Ruby