You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Per Andreas Buer <pe...@linpro.no> on 2008/01/26 09:11:22 UTC

crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Hi.

I'm indexing an intranet and I see some pages are fetched twenty times. 
There are a lot of anchors used so there are a lot of links like the 
ones in the subject.

Is there some way I can instruct the crawler to discard the part of the 
url which is after the hash sign? I'm using nutch from trunk a few 
months back in time.

TIA,


Per.

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Posted by Per Andreas Buer <pe...@linpro.no>.

Excellent.

Thank you.

Per.


Marcin Okraszewski skrev:
> There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. remove string after '#"). Remember to have urlnormalizer-regex in plugins.include option (nutch-site.xml).
> 
> Marcin
> 
> 
> Dnia 26 stycznia 2008 9:36 Prafulla <pr...@gmail.com> napisał(a):
> 
>> Hi,
>>
>> The crawl-urlfilter.txt in conf directory can be used to provide regular
>> expressions to control the urls that are crawled. However this will help you
>> to ignore urls containing #. I don't think you can ask the crawler to just
>> ignore the part of the url after the hash sign by configuring properties,
>> you may have to write some code to achieve that
>>
>> Regards,
>> Prafulla
>>
>> On Jan 26, 2008 1:41 PM, Per Andreas Buer  wrote:
>>
>>> Hi.
>>>
>>> I'm indexing an intranet and I see some pages are fetched twenty times.
>>> There are a lot of anchors used so there are a lot of links like the
>>> ones in the subject.
>>>
>>> Is there some way I can instruct the crawler to discard the part of the
>>> url which is after the hash sign? I'm using nutch from trunk a few
>>> months back in time.
>>>
>>> TIA,
>>>
>>>
>>> Per.
>>>

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Posted by Marcin Okraszewski <ok...@o2.pl>.

There is regex-normalize.xml in conf dir, which allows to manipulate URLs (eg. remove string after '#"). Remember to have urlnormalizer-regex in plugins.include option (nutch-site.xml).

Marcin


Dnia 26 stycznia 2008 9:36 Prafulla <pr...@gmail.com> napisał(a):

> Hi,
> 
> The crawl-urlfilter.txt in conf directory can be used to provide regular
> expressions to control the urls that are crawled. However this will help you
> to ignore urls containing #. I don't think you can ask the crawler to just
> ignore the part of the url after the hash sign by configuring properties,
> you may have to write some code to achieve that
> 
> Regards,
> Prafulla
> 
> On Jan 26, 2008 1:41 PM, Per Andreas Buer  wrote:
> 
> > Hi.
> >
> > I'm indexing an intranet and I see some pages are fetched twenty times.
> > There are a lot of anchors used so there are a lot of links like the
> > ones in the subject.
> >
> > Is there some way I can instruct the crawler to discard the part of the
> > url which is after the hash sign? I'm using nutch from trunk a few
> > months back in time.
> >
> > TIA,
> >
> >
> > Per.
> >
>

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Posted by Prafulla <pr...@gmail.com>.

Hi,

The crawl-urlfilter.txt in conf directory can be used to provide regular
expressions to control the urls that are crawled. However this will help you
to ignore urls containing #. I don't think you can ask the crawler to just
ignore the part of the url after the hash sign by configuring properties,
you may have to write some code to achieve that

Regards,
Prafulla

On Jan 26, 2008 1:41 PM, Per Andreas Buer <pe...@linpro.no> wrote:

> Hi.
>
> I'm indexing an intranet and I see some pages are fetched twenty times.
> There are a lot of anchors used so there are a lot of links like the
> ones in the subject.
>
> Is there some way I can instruct the crawler to discard the part of the
> url which is after the hash sign? I'm using nutch from trunk a few
> months back in time.
>
> TIA,
>
>
> Per.
>

Re: crawler fetching both http://foo/bar#quux and http://foo/bar#zoo

Posted by Siddhartha Reddy <si...@grok.in>.

Adding this to your conf/regex-normalize.xml should remove the anchor from
the URLs:

<regex>
  <pattern>\#(.*)</pattern>
  <substitution></substitution>
</regex>

Regards,
Siddhartha

On Jan 26, 2008 1:41 PM, Per Andreas Buer <pe...@linpro.no> wrote:

> Hi.
>
> I'm indexing an intranet and I see some pages are fetched twenty times.
> There are a lot of anchors used so there are a lot of links like the
> ones in the subject.
>
> Is there some way I can instruct the crawler to discard the part of the
> url which is after the hash sign? I'm using nutch from trunk a few
> months back in time.
>
> TIA,
>
>
> Per.
>