You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sammy Yu <sy...@brightedge.com> on 2011/08/06 12:11:16 UTC
Issue with erroneous URL
Hi,
I'm using nutch-1.2 to do a single site specific crawl. I'm noticing that with most of the crawls, notch is parsing out erroneous URLs that have text/javascript at the end of it. I have taken out the parse-js plugin thinking that it was the culprit, but it's still behaving the same way
Some examples from a segment dump:
Recno:: 1451
URL:: http://www.frys.com/template/software/
ParseData::
Version: 5
Status: success(1,0)
Title: Fry's Electronics | Software
Outlinks: 42
outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/ anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/software/ToPrHbCi3iWPcLxahibHcg__.node1 anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/software/text/javascript anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/software/frys.com anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/index/ anchor: nofollow: false
Recno:: 29
URL:: http://www.frys.com/category/Outpost/Appliances/Fabric+Care/Irons
ParseData::
Version: 5
Status: success(1,0)
Title: Fry's Electronics | Fabric Care
Outlinks: 55
outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/ anchor: nofollow: false
outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/odaTikVABlUwfEfOiz1RyA__.node1 anchor: nofollow: false
outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/text/javascript anchor: nofollow: false
outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/frys.com anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/index/ anchor: nofollow: false
outlink: toUrl: http://www.frys.com/template/computerspc anchor: nofollow: false
Also paths referenced in javascript sections seems to be automatically parsed out as URL.
<SCRIPT language="JavaScript"><!--
var gbl_ImgSvr = "http://images.frys.com";
var gbl_TabsImgPath = "/art/0900_site/header/tabs1000v06/";
/art/0900_site/header/tabs1000v06/ is expanded as http://www.frys.com/art/0900_site/header/tabs1000v06.
Is it possible to skip the <script> section of the parsed text?
Any help would be greatly appreciated.
Thanks,
Sammy
Re: Issue with erroneous URL
Posted by Julien Nioche <li...@gmail.com>.
Sammy,
Parse-js has been deactivated by default in 1.3. Do 'ant clean job' after
modifying nutch-site.xml just to be sure, then reparse the segments.
Julien
On 6 August 2011 11:11, Sammy Yu <sy...@brightedge.com> wrote:
> Hi,
> I'm using nutch-1.2 to do a single site specific crawl. I'm noticing
> that with most of the crawls, notch is parsing out erroneous URLs that have
> text/javascript at the end of it. I have taken out the parse-js plugin
> thinking that it was the culprit, but it's still behaving the same way
> Some examples from a segment dump:
>
> Recno:: 1451
> URL:: http://www.frys.com/template/software/
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Fry's Electronics | Software
> Outlinks: 42
> outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/anchor: nofollow: false
> outlink: toUrl:
> http://www.frys.com/template/software/ToPrHbCi3iWPcLxahibHcg__.node1anchor: nofollow: false
> outlink: toUrl: http://www.frys.com/template/software/text/javascriptanchor: nofollow: false
> outlink: toUrl: http://www.frys.com/template/software/frys.com anchor:
> nofollow: false
> outlink: toUrl: http://www.frys.com/template/index/ anchor: nofollow:
> false
>
> Recno:: 29
> URL:: http://www.frys.com/category/Outpost/Appliances/Fabric+Care/Irons
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Fry's Electronics | Fabric Care
> Outlinks: 55
> outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/anchor: nofollow: false
> outlink: toUrl:
> http://www.frys.com/category/Outpost/Appliances/odaTikVABlUwfEfOiz1RyA__.node1anchor: nofollow: false
> outlink: toUrl:
> http://www.frys.com/category/Outpost/Appliances/text/javascript anchor:
> nofollow: false
> outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/frys.comanchor: nofollow: false
> outlink: toUrl: http://www.frys.com/template/index/ anchor: nofollow:
> false
> outlink: toUrl: http://www.frys.com/template/computerspc anchor:
> nofollow: false
>
> Also paths referenced in javascript sections seems to be automatically
> parsed out as URL.
> <SCRIPT language="JavaScript"><!--
> var gbl_ImgSvr = "http://images.frys.com";
> var gbl_TabsImgPath = "/art/0900_site/header/tabs1000v06/";
> /art/0900_site/header/tabs1000v06/ is expanded as
> http://www.frys.com/art/0900_site/header/tabs1000v06.
> Is it possible to skip the <script> section of the parsed text?
>
> Any help would be greatly appreciated.
>
> Thanks,
> Sammy
>
>
--
*
*Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com