You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sammy Yu <sy...@brightedge.com> on 2011/08/06 12:11:16 UTC

Issue with erroneous URL

Hi,
   I'm using nutch-1.2 to do a single site specific crawl.  I'm noticing that with most of the crawls, notch is parsing out erroneous URLs that have text/javascript at the end of it.  I have taken out the parse-js plugin thinking that it was the culprit, but it's still behaving the same way
Some examples from a  segment dump:

Recno:: 1451
URL:: http://www.frys.com/template/software/

ParseData::
Version: 5
Status: success(1,0)
Title: Fry's Electronics | Software
Outlinks: 42
  outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/ anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/software/ToPrHbCi3iWPcLxahibHcg__.node1 anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/software/text/javascript anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/software/frys.com anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/index/ anchor:  nofollow: false

Recno:: 29
URL:: http://www.frys.com/category/Outpost/Appliances/Fabric+Care/Irons
ParseData::
Version: 5
Status: success(1,0)
Title: Fry's Electronics | Fabric Care
Outlinks: 55
  outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/ anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/odaTikVABlUwfEfOiz1RyA__.node1 anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/text/javascript anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/frys.com anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/index/ anchor:  nofollow: false
  outlink: toUrl: http://www.frys.com/template/computerspc anchor:  nofollow: false

Also paths referenced in javascript sections seems to be automatically parsed out as URL.
<SCRIPT language="JavaScript"><!--
var gbl_ImgSvr = "http://images.frys.com";
var gbl_TabsImgPath = "/art/0900_site/header/tabs1000v06/";
/art/0900_site/header/tabs1000v06/ is expanded as http://www.frys.com/art/0900_site/header/tabs1000v06.
Is it possible to skip the <script> section of the parsed text?

Any help would be greatly appreciated.

Thanks,
Sammy


Re: Issue with erroneous URL

Posted by Julien Nioche <li...@gmail.com>.
Sammy,

Parse-js has been deactivated by default in 1.3. Do 'ant clean job' after
modifying nutch-site.xml just to be sure, then reparse the segments.

Julien

On 6 August 2011 11:11, Sammy Yu <sy...@brightedge.com> wrote:

> Hi,
>   I'm using nutch-1.2 to do a single site specific crawl.  I'm noticing
> that with most of the crawls, notch is parsing out erroneous URLs that have
> text/javascript at the end of it.  I have taken out the parse-js plugin
> thinking that it was the culprit, but it's still behaving the same way
> Some examples from a  segment dump:
>
> Recno:: 1451
> URL:: http://www.frys.com/template/software/
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Fry's Electronics | Software
> Outlinks: 42
>  outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/anchor:  nofollow: false
>  outlink: toUrl:
> http://www.frys.com/template/software/ToPrHbCi3iWPcLxahibHcg__.node1anchor:  nofollow: false
>  outlink: toUrl: http://www.frys.com/template/software/text/javascriptanchor:  nofollow: false
>  outlink: toUrl: http://www.frys.com/template/software/frys.com anchor:
>  nofollow: false
>  outlink: toUrl: http://www.frys.com/template/index/ anchor:  nofollow:
> false
>
> Recno:: 29
> URL:: http://www.frys.com/category/Outpost/Appliances/Fabric+Care/Irons
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Fry's Electronics | Fabric Care
> Outlinks: 55
>  outlink: toUrl: http://www.frys.com/art/0900_site/header/tabs1000v06/anchor:  nofollow: false
>  outlink: toUrl:
> http://www.frys.com/category/Outpost/Appliances/odaTikVABlUwfEfOiz1RyA__.node1anchor:  nofollow: false
>  outlink: toUrl:
> http://www.frys.com/category/Outpost/Appliances/text/javascript anchor:
>  nofollow: false
>  outlink: toUrl: http://www.frys.com/category/Outpost/Appliances/frys.comanchor:  nofollow: false
>  outlink: toUrl: http://www.frys.com/template/index/ anchor:  nofollow:
> false
>  outlink: toUrl: http://www.frys.com/template/computerspc anchor:
>  nofollow: false
>
> Also paths referenced in javascript sections seems to be automatically
> parsed out as URL.
> <SCRIPT language="JavaScript"><!--
> var gbl_ImgSvr = "http://images.frys.com";
> var gbl_TabsImgPath = "/art/0900_site/header/tabs1000v06/";
> /art/0900_site/header/tabs1000v06/ is expanded as
> http://www.frys.com/art/0900_site/header/tabs1000v06.
> Is it possible to skip the <script> section of the parsed text?
>
> Any help would be greatly appreciated.
>
> Thanks,
> Sammy
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com