You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@jspwiki.apache.org by Foster Schucker <Fo...@Schucker.org> on 2015/12/28 16:14:02 UTC

A way to find dead links for external pages

I have a wiki that ties into external sites.  As these places switch to 
new platforms the old links die.  (Or they die due to refactoring).

Anyway I'm looking for a way to walk the wiki pages and see if there is 
a 200 response back from the other side.  I'm guessing one of you have 
had to do this before, no sense in reinventing the wheel.

In a perfect world it would spit out [PageName] URL ResponseCode for 
each URL (that would let me catch other errors like forbidden, etc.  but 
I'd be happy to get the ones that don't produce a 200.

Thanks!
Foster


Re: A way to find dead links for external pages

Posted by Paul Uszak <pa...@gmail.com>.
What security issues does this present?

Remember, the link targets have been selected by the wiki authors so they
are desirable links rather than spam.  I can't quite see a threat model if
all you're doing is reporting a HTTP status code.  Isn't this what search
engines do by default?

On 4 January 2016 at 06:27, Derek Hohls <dh...@csir.co.za> wrote:

> There seem to be any number of tools out there... have you seen:
> https://wummel.github.io/linkchecker/
> (showing my bias towards Python)
>
> >>> Foster Schucker <Fo...@Schucker.org> 12/28/15 8:39 PM >>>
> Thanks. I didn't think there was a way in JSPWiki to do that. I was on
> the look for a tool that was smart enough to do the walk and only report
> back on external sites/links. I figured with the number of people on
> this list that do that I could get a quick recommendation of a tool that
> someone liked.
>
> Thanks!
>
> Foster
> On Mon, 28 Dec 2015 19:09:43 +0100, Harry Metske
> <ha...@gmail.com> wrote:
>
> There has been a discussion about this before:
> https://issues.apache.org/jira/browse/JSPWIKI-330
>
> We considered it a security risk and did not implement it.
>
> kind regards,
> Harry
>
>
> On 28 December 2015 at 16:14, Foster Schucker <Fo...@schucker.org>
> wrote:
>
> > I have a wiki that ties into external sites. As these places
> switch to
> > new platforms the old links die. (Or they die due to refactoring).
> >
> > Anyway I'm looking for a way to walk the wiki pages and see if
> there is a
> > 200 response back from the other side. I'm guessing one of you
> have had to
> > do this before, no sense in reinventing the wheel.
> >
> > In a perfect world it would spit out [PageName] URL ResponseCode
> for each
> > URL (that would let me catch other errors like forbidden, etc.
> but I'd be
> > happy to get the ones that don't produce a 200.
> >
> > Thanks!
> > Foster
> >
> >
>
>
>
> --
> This message is subject to the CSIR's copyright terms and conditions,
> e-mail legal notice, and implemented Open Document Format (ODF) standard.
> The full disclaimer details can be found at
> http://www.csir.co.za/disclaimer.html.
>
> This message has been scanned for viruses and dangerous content by
> MailScanner,
> and is believed to be clean.
>
> Please consider the environment before printing this email.
>
>
> --
> This message is subject to the CSIR's copyright terms and conditions,
> e-mail legal notice, and implemented Open Document Format (ODF) standard.
> The full disclaimer details can be found at
> http://www.csir.co.za/disclaimer.html.
>
>
> This message has been scanned for viruses and dangerous content by
> *MailScanner* <http://www.mailscanner.info/>,
> and is believed to be clean.
>
>
> Please consider the environment before printing this email.
>
>

Re: A way to find dead links for external pages

Posted by Derek Hohls <dh...@csir.co.za>.
There seem to be any number of tools out there... have you seen:
https://wummel.github.io/linkchecker/
(showing my bias towards Python)



>>> Foster Schucker <Fo...@Schucker.org> 12/28/15 8:39 PM >>>
Thanks.  I didn't think there was a way in JSPWiki to do that.  I was on 
the look for a tool that was smart enough to do the walk and only report 
back on external sites/links.   I figured with the number of people on 
this list that do that I could get a quick recommendation of a tool that 
someone liked.

Thanks!

Foster
On Mon, 28 Dec 2015 19:09:43 +0100, Harry Metske 
<ha...@gmail.com> wrote:

    There has been a discussion about this before:
    https://issues.apache.org/jira/browse/JSPWIKI-330

    We considered it a security risk and did not implement it.

    kind regards,
    Harry


    On 28 December 2015 at 16:14, Foster Schucker <Fo...@schucker.org>
    wrote:

     > I have a wiki that ties into external sites. As these places
    switch to
     > new platforms the old links die. (Or they die due to refactoring).
     >
     > Anyway I'm looking for a way to walk the wiki pages and see if
    there is a
     > 200 response back from the other side. I'm guessing one of you
    have had to
     > do this before, no sense in reinventing the wheel.
     >
     > In a perfect world it would spit out [PageName] URL ResponseCode
    for each
     > URL (that would let me catch other errors like forbidden, etc.
    but I'd be
     > happy to get the ones that don't produce a 200.
     >
     > Thanks!
     > Foster
     >
     >



-- 
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. 
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

This message has been scanned for viruses and dangerous content by MailScanner, 
and is believed to be clean.

Please consider the environment before printing this email.




-- 
This message is subject to the CSIR's copyright terms and conditions, e-mail legal notice, and implemented Open Document Format (ODF) standard. 
The full disclaimer details can be found at http://www.csir.co.za/disclaimer.html.

This message has been scanned for viruses and dangerous content by MailScanner, 
and is believed to be clean.

Please consider the environment before printing this email.


Re: A way to find dead links for external pages

Posted by Foster Schucker <Fo...@Schucker.org>.
Thanks.  I didn't think there was a way in JSPWiki to do that.  I was on 
the look for a tool that was smart enough to do the walk and only report 
back on external sites/links.   I figured with the number of people on 
this list that do that I could get a quick recommendation of a tool that 
someone liked.

Thanks!

Foster
On Mon, 28 Dec 2015 19:09:43 +0100, Harry Metske 
<ha...@gmail.com> wrote:

    There has been a discussion about this before:
    https://issues.apache.org/jira/browse/JSPWIKI-330

    We considered it a security risk and did not implement it.

    kind regards,
    Harry


    On 28 December 2015 at 16:14, Foster Schucker <Fo...@schucker.org>
    wrote:

     > I have a wiki that ties into external sites. As these places
    switch to
     > new platforms the old links die. (Or they die due to refactoring).
     >
     > Anyway I'm looking for a way to walk the wiki pages and see if
    there is a
     > 200 response back from the other side. I'm guessing one of you
    have had to
     > do this before, no sense in reinventing the wheel.
     >
     > In a perfect world it would spit out [PageName] URL ResponseCode
    for each
     > URL (that would let me catch other errors like forbidden, etc.
    but I'd be
     > happy to get the ones that don't produce a 200.
     >
     > Thanks!
     > Foster
     >
     >



RE: A way to find dead links for external pages

Posted by Jason Morris <ja...@sydney.edu.au>.
Hi All,

Just a thought…

I've had some success writing website testing utilities using a mashup of Crawl4J<https://github.com/yasserg/crawler4j> and HtmlUnit<http://htmlunit.sourceforge.net/>.

It's not too hard to hack something together to crawl a wiki and accumulate dead links.

Cheers,

Jason



-----Original Message-----
From: Juan Pablo Santos Rodríguez [mailto:juanpablo.santos@gmail.com]
Sent: Wednesday, 30 December 2015 4:15 AM
To: user@jspwiki.apache.org
Subject: Re: A way to find dead links for external pages



Hi,



we bundle, as an example, not intended for production use, a PingWeblogscomFilter [#1], which pings weblog.com on each page save (a much older, similar approach on [#2]). A plugin, performing similar functionality, could be easily made and placed on a protected wikipage, or better, perform the ping only for a given set of users / groups.



As for protecting the changing urls of external sites, you could define some interwiki links [#3]





HTH,

juan pablo





[#1]:

http://jspwiki.apache.org/apidocs/2.10.1/org/apache/wiki/filters/PingWeblogsComFilter.html

[#2]. http://www.ecyrd.com/JSPWiki/wiki/WeblogsPing

[#3]: https://jspwiki-wiki.apache.org/Wiki.jsp?page=InterWiki



On Tue, Dec 29, 2015 at 2:27 PM, Adrien Beau <ad...@gmail.com>> wrote:



> On Mon, Dec 28, 2015 at 7:09 PM, Harry Metske <ha...@gmail.com>>

>  wrote:

> >

> > We considered it a security risk and did not implement it.

>

> Having a server go blindly into user-specified URLs is indeed a huge

> security risk. Users could easily create a denial of service (listing

> hundreds of URLs) either for the target or the JSPWiki server itself.

> They could also use the feature to exploit vulnerable URLs, disguising

> themselves as the JSPWiki server.

>

> However, I believe safer, more limited approaches could be used, that

> would still provide value to site administrators (from least to most

> dangerous, from least to most value to the administrator):

>

> - Collate all host names mentioned in wiki pages; run one DNS query

> per host name (using rate limits); take note of which host names are

> not existent anymore; report pages that contain links to those hosts

> - Similar idea, but run one HEAD HTTP request to the root (/) of each

> host name in addition to resolving the name

> - Similar idea, up to the path component of the URL; canonicalize it,

> apply a size limit, remove queries and fragments; this should still be

> rather safe

>

> (Note that these are only ideas. I am not volunteering to implement

> them.)

>

> --

> Adrien

>

Re: A way to find dead links for external pages

Posted by Juan Pablo Santos Rodríguez <ju...@gmail.com>.
Hi,

we bundle, as an example, not intended for production use, a
PingWeblogscomFilter [#1], which pings weblog.com on each page save (a much
older, similar approach on [#2]). A plugin, performing similar
functionality, could be easily made and placed on a protected wikipage, or
better, perform the ping only for a given set of users / groups.

As for protecting the changing urls of external sites, you could define
some interwiki links [#3]


HTH,
juan pablo


[#1]:
http://jspwiki.apache.org/apidocs/2.10.1/org/apache/wiki/filters/PingWeblogsComFilter.html
[#2]. http://www.ecyrd.com/JSPWiki/wiki/WeblogsPing
[#3]: https://jspwiki-wiki.apache.org/Wiki.jsp?page=InterWiki

On Tue, Dec 29, 2015 at 2:27 PM, Adrien Beau <ad...@gmail.com> wrote:

> On Mon, Dec 28, 2015 at 7:09 PM, Harry Metske <ha...@gmail.com>
>  wrote:
> >
> > We considered it a security risk and did not implement it.
>
> Having a server go blindly into user-specified URLs is indeed a huge
> security risk. Users could easily create a denial of service (listing
> hundreds of URLs) either for the target or the JSPWiki server itself. They
> could also use the feature to exploit vulnerable URLs, disguising
> themselves as the JSPWiki server.
>
> However, I believe safer, more limited approaches could be used, that would
> still provide value to site administrators (from least to most dangerous,
> from least to most value to the administrator):
>
> - Collate all host names mentioned in wiki pages; run one DNS query per
> host name (using rate limits); take note of which host names are not
> existent anymore; report pages that contain links to those hosts
> - Similar idea, but run one HEAD HTTP request to the root (/) of each host
> name in addition to resolving the name
> - Similar idea, up to the path component of the URL; canonicalize it, apply
> a size limit, remove queries and fragments; this should still be rather
> safe
>
> (Note that these are only ideas. I am not volunteering to implement them.)
>
> --
> Adrien
>

Re: A way to find dead links for external pages

Posted by Adrien Beau <ad...@gmail.com>.
On Mon, Dec 28, 2015 at 7:09 PM, Harry Metske <ha...@gmail.com>
 wrote:
>
> We considered it a security risk and did not implement it.

Having a server go blindly into user-specified URLs is indeed a huge
security risk. Users could easily create a denial of service (listing
hundreds of URLs) either for the target or the JSPWiki server itself. They
could also use the feature to exploit vulnerable URLs, disguising
themselves as the JSPWiki server.

However, I believe safer, more limited approaches could be used, that would
still provide value to site administrators (from least to most dangerous,
from least to most value to the administrator):

- Collate all host names mentioned in wiki pages; run one DNS query per
host name (using rate limits); take note of which host names are not
existent anymore; report pages that contain links to those hosts
- Similar idea, but run one HEAD HTTP request to the root (/) of each host
name in addition to resolving the name
- Similar idea, up to the path component of the URL; canonicalize it, apply
a size limit, remove queries and fragments; this should still be rather safe

(Note that these are only ideas. I am not volunteering to implement them.)

-- 
Adrien

Re: A way to find dead links for external pages

Posted by Harry Metske <ha...@gmail.com>.
There has been a discussion about this before:
https://issues.apache.org/jira/browse/JSPWIKI-330

We considered it a security risk and did not implement it.

kind regards,
Harry


On 28 December 2015 at 16:14, Foster Schucker <Fo...@schucker.org> wrote:

> I have a wiki that ties into external sites.  As these places switch to
> new platforms the old links die.  (Or they die due to refactoring).
>
> Anyway I'm looking for a way to walk the wiki pages and see if there is a
> 200 response back from the other side.  I'm guessing one of you have had to
> do this before, no sense in reinventing the wheel.
>
> In a perfect world it would spit out [PageName] URL ResponseCode for each
> URL (that would let me catch other errors like forbidden, etc.  but I'd be
> happy to get the ones that don't produce a 200.
>
> Thanks!
> Foster
>
>