You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Roberto Gardenier <r....@simgroep.nl> on 2012/05/01 13:25:25 UTC

Crawl sites with hashtags in url

Hello,

 

Im currently trying to crawl a site which uses hashtags in the urls. I dont
seem to get any results and Im hoping im just overlooking something.

I have created a JIRA bug report because I was not aware of the existence of
this mailing list. Its my first time using such channels so i hope correctly
sending  this message.

Link: https://issues.apache.org/jira/browse/NUTCH-1343

 

The site structure that im trying to index, is as follow:

http://domain.com (landingpage)

http://domain.com/#/page1

http://domain.com/#/page1/subpage1

http://domain.com/#/page2

http://domain.com/#/page2/subpage1

and so on.

 

I've pointed nutch to http://domain.com as start url and in my filter i've
placed all kind of rules.

First i thought this would be sufficient:

+http\://domain\.com\/#

But then i realised that # is used for comments so i escaped it:

+http\://domain\.com\/#

 

Still no results. So i thought i could use the asterix for it:

+http\://domain\.com\/*

Still no luck.. So i started using various regex stuff but without success.

 

I noticed the following messages in hadoop.log:

INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off

Ive researched on this setting but i dont know for sure if this affects my
problem in a way. This property is set to false in my configs.

 

I dont know if this is even related to the situation above but maybe it
helps.

 

Any help is very much appreciated! I've tried googling the problem but i
couldnt find documentation or anyone else with this problem.

 

Many thanks in advance.

 

With kind regard,

Roberto Gardenier

RE: Crawl sites with hashtags in url

Posted by Roberto Gardenier <r....@simgroep.nl>.

Hi Sebastian, 

I have looked at the RFC and im convinced that i dont need to take any further action on this issue, as is that this website is just not following the rules. Just like twitter... but who cares.
Its not our problem anymore, thank you so much for your reply!

Kind regards,
Roberto Gardenier 
    

-----Oorspronkelijk bericht-----
Van: Sebastian Nagel [mailto:wastl.nagel@googlemail.com] 
Verzonden: dinsdag 1 mei 2012 23:21
Aan: user@nutch.apache.org
Onderwerp: Re: Crawl sites with hashtags in url

Hi Roberto,

as defined in ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt the
hash ('#') is used to separate the "fragment" from the rest of the URL.
The RFC explicitly delegates the semantics of the fragment to the media
type of the document. In good old HTML the fragment is just an "anchor"
and should be removed - otherwise the same physical document is fetched
multiple times by different URLs. That's the current behavior of Nutch,
see Markus' explanations.

Nowadays (with AJAX), the situation is changing and anchors are used to
address not a different view but indeed different content. Have a look
at NUTCH-1323 and Markus' comment on NUTCH-1339, maybe this will help
you to solve the problem.

Sebastian


On 05/01/2012 01:44 PM, Markus Jelsma wrote:
> Hi,
>
> URL's are passed through a series of normalizers. By default both the
> RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter
> removes it completely and is not configurable.
>
> You can either hack your way through it by simply disabling the removal of the
> page reference or make it configurable. In that case you're welcome to attach
> a patch to a new issue in Jira.
>
> Cheers,
>
>
> On Tuesday 01 May 2012 13:25:25 Roberto Gardenier wrote:
>> Hello,
>>
>>
>>
>> Im currently trying to crawl a site which uses hashtags in the urls. I dont
>> seem to get any results and Im hoping im just overlooking something.
>>
>> I have created a JIRA bug report because I was not aware of the existence
>> of this mailing list. Its my first time using such channels so i hope
>> correctly sending  this message.
>>
>> Link: https://issues.apache.org/jira/browse/NUTCH-1343
>>
>>
>>
>> The site structure that im trying to index, is as follow:
>>
>> http://domain.com (landingpage)
>>
>> http://domain.com/#/page1
>>
>> http://domain.com/#/page1/subpage1
>>
>> http://domain.com/#/page2
>>
>> http://domain.com/#/page2/subpage1
>>
>> and so on.
>>
>>
>>
>> I've pointed nutch to http://domain.com as start url and in my filter i've
>> placed all kind of rules.
>>
>> First i thought this would be sufficient:
>>
>> +http\://domain\.com\/#
>>
>> But then i realised that # is used for comments so i escaped it:
>>
>> +http\://domain\.com\/#
>>
>>
>>
>> Still no results. So i thought i could use the asterix for it:
>>
>> +http\://domain\.com\/*
>>
>> Still no luck.. So i started using various regex stuff but without success.
>>
>>
>>
>> I noticed the following messages in hadoop.log:
>>
>> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
>>
>> Ive researched on this setting but i dont know for sure if this affects my
>> problem in a way. This property is set to false in my configs.
>>
>>
>>
>> I dont know if this is even related to the situation above but maybe it
>> helps.
>>
>>
>>
>> Any help is very much appreciated! I've tried googling the problem but i
>> couldnt find documentation or anyone else with this problem.
>>
>>
>>
>> Many thanks in advance.
>>
>>
>>
>> With kind regard,
>>
>> Roberto Gardenier
>

Re: Crawl sites with hashtags in url

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi Roberto,

as defined in ftp://ftp.rfc-editor.org/in-notes/rfc3986.txt the
hash ('#') is used to separate the "fragment" from the rest of the URL.
The RFC explicitly delegates the semantics of the fragment to the media
type of the document. In good old HTML the fragment is just an "anchor"
and should be removed - otherwise the same physical document is fetched
multiple times by different URLs. That's the current behavior of Nutch,
see Markus' explanations.

Nowadays (with AJAX), the situation is changing and anchors are used to
address not a different view but indeed different content. Have a look
at NUTCH-1323 and Markus' comment on NUTCH-1339, maybe this will help
you to solve the problem.

Sebastian


On 05/01/2012 01:44 PM, Markus Jelsma wrote:
> Hi,
>
> URL's are passed through a series of normalizers. By default both the
> RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter
> removes it completely and is not configurable.
>
> You can either hack your way through it by simply disabling the removal of the
> page reference or make it configurable. In that case you're welcome to attach
> a patch to a new issue in Jira.
>
> Cheers,
>
>
> On Tuesday 01 May 2012 13:25:25 Roberto Gardenier wrote:
>> Hello,
>>
>>
>>
>> Im currently trying to crawl a site which uses hashtags in the urls. I dont
>> seem to get any results and Im hoping im just overlooking something.
>>
>> I have created a JIRA bug report because I was not aware of the existence
>> of this mailing list. Its my first time using such channels so i hope
>> correctly sending  this message.
>>
>> Link: https://issues.apache.org/jira/browse/NUTCH-1343
>>
>>
>>
>> The site structure that im trying to index, is as follow:
>>
>> http://domain.com (landingpage)
>>
>> http://domain.com/#/page1
>>
>> http://domain.com/#/page1/subpage1
>>
>> http://domain.com/#/page2
>>
>> http://domain.com/#/page2/subpage1
>>
>> and so on.
>>
>>
>>
>> I've pointed nutch to http://domain.com as start url and in my filter i've
>> placed all kind of rules.
>>
>> First i thought this would be sufficient:
>>
>> +http\://domain\.com\/#
>>
>> But then i realised that # is used for comments so i escaped it:
>>
>> +http\://domain\.com\/#
>>
>>
>>
>> Still no results. So i thought i could use the asterix for it:
>>
>> +http\://domain\.com\/*
>>
>> Still no luck.. So i started using various regex stuff but without success.
>>
>>
>>
>> I noticed the following messages in hadoop.log:
>>
>> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
>>
>> Ive researched on this setting but i dont know for sure if this affects my
>> problem in a way. This property is set to false in my configs.
>>
>>
>>
>> I dont know if this is even related to the situation above but maybe it
>> helps.
>>
>>
>>
>> Any help is very much appreciated! I've tried googling the problem but i
>> couldnt find documentation or anyone else with this problem.
>>
>>
>>
>> Many thanks in advance.
>>
>>
>>
>> With kind regard,
>>
>> Roberto Gardenier
>

Re: Crawl sites with hashtags in url

Posted by Markus Jelsma <ma...@openindex.io>.

Hi,

URL's are passed through a series of normalizers. By default both the 
RegexNormalizer and the BasicNormalizer affect URL's with anchors, the latter 
removes it completely and is not configurable.

You can either hack your way through it by simply disabling the removal of the 
page reference or make it configurable. In that case you're welcome to attach 
a patch to a new issue in Jira.

Cheers,


On Tuesday 01 May 2012 13:25:25 Roberto Gardenier wrote:
> Hello,
> 
> 
> 
> Im currently trying to crawl a site which uses hashtags in the urls. I dont
> seem to get any results and Im hoping im just overlooking something.
> 
> I have created a JIRA bug report because I was not aware of the existence
> of this mailing list. Its my first time using such channels so i hope
> correctly sending  this message.
> 
> Link: https://issues.apache.org/jira/browse/NUTCH-1343
> 
> 
> 
> The site structure that im trying to index, is as follow:
> 
> http://domain.com (landingpage)
> 
> http://domain.com/#/page1
> 
> http://domain.com/#/page1/subpage1
> 
> http://domain.com/#/page2
> 
> http://domain.com/#/page2/subpage1
> 
> and so on.
> 
> 
> 
> I've pointed nutch to http://domain.com as start url and in my filter i've
> placed all kind of rules.
> 
> First i thought this would be sufficient:
> 
> +http\://domain\.com\/#
> 
> But then i realised that # is used for comments so i escaped it:
> 
> +http\://domain\.com\/#
> 
> 
> 
> Still no results. So i thought i could use the asterix for it:
> 
> +http\://domain\.com\/*
> 
> Still no luck.. So i started using various regex stuff but without success.
> 
> 
> 
> I noticed the following messages in hadoop.log:
> 
> INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
> 
> Ive researched on this setting but i dont know for sure if this affects my
> problem in a way. This property is set to false in my configs.
> 
> 
> 
> I dont know if this is even related to the situation above but maybe it
> helps.
> 
> 
> 
> Any help is very much appreciated! I've tried googling the problem but i
> couldnt find documentation or anyone else with this problem.
> 
> 
> 
> Many thanks in advance.
> 
> 
> 
> With kind regard,
> 
> Roberto Gardenier

-- 
Markus Jelsma - CTO - Openindex