You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ye T Thet <ye...@gmail.com> on 2012/08/27 17:27:34 UTC

Parsing Outlinks from plain text for HTML documents

Hi Folks,

I am seeking opinions on parsing outlinks as plain text from html
documents. The reason is that some of the sites I am trying to crawl uses
javascripts for outlink URLs. Example.

<form action="../" name="bloglinkform">
<select onchange="this.form.window_namer.value++;if
(this.options[this.selectedIndex].value!='MORE')
{window.open(this.options[this.selectedIndex].value,'WinName'+this.form.window_namer.value,'toolbar=1,location=1,directories=1,status=1,menubar=1,scrollbars=1,resizable=2')}"
name="bloglinkselect">
<option selected="selected" value="MORE"/>text 1
<option value="http://craweledsite.blogspot.com/2007/11/blog-post_7360.html"/>text
2
<option value="http://craweledsite.blogspot.com/2007/09/blog-post_10.html"/>text
3
</select>
<input value="1" name="window_namer" type="hidden"/>
</form>

Since it is the html document, base on my nutch-site.xml config. html
parser is used to extract the outlinks from the page. Thus URLs on those
form tags are not extracted as outlinks. After digging into nutch source
code, I discovered OutlinkExtractor is used by plain text parsers to
extract URLs from a plain text. Apparently I could modified the HTML
parsers to extract outlinks using regex.

The question is would it be feasible or advisable in terms of resource
efficiency to extract Outlinks from HTML as plain text using regex?

Does anyone knows better approach for the scenario?

Thanks,

Ye