You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ye T Thet <ye...@gmail.com> on 2012/08/26 18:06:13 UTC

Extracting non anchored URLs from page

Hi Folks,

I am using nutch (1.2 and 1.5) to crawl some website.

The short question is that is there anyway or plug-ins to extracts URLs
which are not in anchor tags in a page.

The long question:

The crawler is not extraction some of the URLs from the page. After the
investigation I noticed that the URLs are not links technically, i.e. not
inside anchor elements. URLs are inside value of other HTML tags used by
javascripts.

Following is the snippet of the contents.

<div class='widget-content'>
<h2 class="sidebar-title">
<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form></h2>
</div>

As mentioned above, the URLs are not in html anchor tags. but rather valid
urls used by javascripts when the user clicks the items.  Thus resulting
those address are not crawled. To make the matter worse, there is no site
map or index page where such urls can be reached other than the above
mentioned links.

Has anyone encounter such cases and have figure out the solution? Any tips
or direction would be great.

Thanks,

Ye

Re: Extracting non anchored URLs from page

Posted by Ye T Thet <ye...@gmail.com>.
Markus, Shaya.

Thanks for the response. I was hoping the scenario has been brought up
before and there is a ready made solution.

As for parsers, I tried both nutch's html parser plugin and tika-plugins.

I used ./bin/nutch org.apache.nutch.parse.ParserChecker
http://crawledsite.blogspot.com/ > parserresult.txt to check the outlinks
from a page. None of the parsers yield non anchored URLs.

To cater for my scenario, should I be looking into OutlinkExtractor? or the
parser plug-ins(html, tika) to get the outlinks?

background: I pretty much of a nutch user who writes plugin here and there
and tweak configs to get things done. I do not have in depth understanding
of the nutch code base. I am open to dig further to get this working for me.

Thanks,

Ye



On Mon, Aug 27, 2012 at 12:25 AM, Markus Jelsma
<ma...@openindex.io>wrote:

> Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser
> did not yield any outlinks.
>
> -----Original message-----
> > From:Ye T Thet <ye...@gmail.com>
> > Sent: Sun 26-Aug-2012 18:09
> > To: user@nutch.apache.org
> > Subject: Extracting non anchored URLs from page
> >
> > Hi Folks,
> >
> > I am using nutch (1.2 and 1.5) to crawl some website.
> >
> > The short question is that is there anyway or plug-ins to extracts URLs
> > which are not in anchor tags in a page.
> >
> > The long question:
> >
> > The crawler is not extraction some of the URLs from the page. After the
> > investigation I noticed that the URLs are not links technically, i.e. not
> > inside anchor elements. URLs are inside value of other HTML tags used by
> > javascripts.
> >
> > Following is the snippet of the contents.
> >
> > <div class='widget-content'>
> > <h2 class="sidebar-title">
> > <form action="../" name="bloglinkform">
> > <select onchange="this.form.window_namer.value++;if
> > (this.options[this.selectedIndex].value!='MORE')
> >
> {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
> > name="bloglinkselect">
> > <option selected="selected" value="MORE"/>text 1
> > <option value="
> http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> > 2
> > <option value="
> http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> > 3
> > </select>
> > <input value="1" name="window_namer" type="hidden"/>
> > </form></h2>
> > </div>
> >
> > As mentioned above, the URLs are not in html anchor tags. but rather
> valid
> > urls used by javascripts when the user clicks the items.  Thus resulting
> > those address are not crawled. To make the matter worse, there is no site
> > map or index page where such urls can be reached other than the above
> > mentioned links.
> >
> > Has anyone encounter such cases and have figure out the solution? Any
> tips
> > or direction would be great.
> >
> > Thanks,
> >
> > Ye
> >
>

RE: Extracting non anchored URLs from page

Posted by Markus Jelsma <ma...@openindex.io>.
Nutch' parser relies on Nutch' OutlinkExtractor is the underlying parser did not yield any outlinks.  
 
-----Original message-----
> From:Ye T Thet <ye...@gmail.com>
> Sent: Sun 26-Aug-2012 18:09
> To: user@nutch.apache.org
> Subject: Extracting non anchored URLs from page
> 
> Hi Folks,
> 
> I am using nutch (1.2 and 1.5) to crawl some website.
> 
> The short question is that is there anyway or plug-ins to extracts URLs
> which are not in anchor tags in a page.
> 
> The long question:
> 
> The crawler is not extraction some of the URLs from the page. After the
> investigation I noticed that the URLs are not links technically, i.e. not
> inside anchor elements. URLs are inside value of other HTML tags used by
> javascripts.
> 
> Following is the snippet of the contents.
> 
> <div class='widget-content'>
> <h2 class="sidebar-title">
> <form action="../" name="bloglinkform">
> <select onchange="this.form.window_namer.value++;if
> (this.options[this.selectedIndex].value!='MORE')
> {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
> name="bloglinkselect">
> <option selected="selected" value="MORE"/>text 1
> <option value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> 2
> <option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> 3
> </select>
> <input value="1" name="window_namer" type="hidden"/>
> </form></h2>
> </div>
> 
> As mentioned above, the URLs are not in html anchor tags. but rather valid
> urls used by javascripts when the user clicks the items.  Thus resulting
> those address are not crawled. To make the matter worse, there is no site
> map or index page where such urls can be reached other than the above
> mentioned links.
> 
> Has anyone encounter such cases and have figure out the solution? Any tips
> or direction would be great.
> 
> Thanks,
> 
> Ye
> 

Re: Extracting non anchored URLs from page

Posted by Shaya Potter <sp...@gmail.com>.
finding urls in plain text is hard

http://www.codinghorror.com/blog/2008/10/the-problem-with-urls.html

I'm dealing with plain text emails (so people might try to offset urls 
with () or _ as well

what i do, based on jeff attwood's post

     static public HashSet<String> urlExtractor(String text) {
         HashSet<String> results = new HashSet<String>();

         Pattern pattern = Pattern
 
.compile("[(_]?http://[-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]");

         Matcher matcher = pattern.matcher(text);

         while (matcher.find()) {
             String url = matcher.group();

             if (url.startsWith("(") || url.startsWith("_")) {
                 if (url.endsWith(")") || url.endsWith("_")) {
                     url = url.substring(1, url.length() - 1);
                 } else {
                     url = url.substring(1, url.length());
                 }
             }

             results.add(url);
         }

         return results;
     }

I also when processing the urls if I get a 404, check to see if the URL 
ends with a ')' example, (for instance, check out http://my.site.com) 
would have a bad ')' at the end.

what you might want to do then is throw all outbound links into a set, 
and then do a pass like this over the document throwing all found links 
into the set

On 08/26/2012 12:06 PM, Ye T Thet wrote:
> Hi Folks,
>
> I am using nutch (1.2 and 1.5) to crawl some website.
>
> The short question is that is there anyway or plug-ins to extracts URLs
> which are not in anchor tags in a page.
>
> The long question:
>
> The crawler is not extraction some of the URLs from the page. After the
> investigation I noticed that the URLs are not links technically, i.e. not
> inside anchor elements. URLs are inside value of other HTML tags used by
> javascripts.
>
> Following is the snippet of the contents.
>
> <div class='widget-content'>
> <h2 class="sidebar-title">
> <form action="../" name="bloglinkform">
> <select onchange="this.form.window_namer.value++;if
> (this.options[this.selectedIndex].value!='MORE')
> {window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
> name="bloglinkselect">
> <option selected="selected" value="MORE"/>text 1
> <option value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
> 2
> <option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
> 3
> </select>
> <input value="1" name="window_namer" type="hidden"/>
> </form></h2>
> </div>
>
> As mentioned above, the URLs are not in html anchor tags. but rather valid
> urls used by javascripts when the user clicks the items.  Thus resulting
> those address are not crawled. To make the matter worse, there is no site
> map or index page where such urls can be reached other than the above
> mentioned links.
>
> Has anyone encounter such cases and have figure out the solution? Any tips
> or direction would be great.
>
> Thanks,
>
> Ye
>