You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Detlef Müller-Solger <d....@durato.eu> on 2008/10/08 13:23:10 UTC

Doublets

Hi,

in Germany it is reported, that one big show stopper for Nutch is the 
fact, that there for example are often identical webpage’s which can be 
addressed by different URLs. For example by requesting

www.xyz.de/information
or by
www.xyz.de/information/
or by
www.xyz.de/information/index

 From my point of view due to the different URLs Nutch is indexing those 
webpages unfortuneately three times. Is there a method to avoid the 
indexing of these doublets? For example by comparing all information of 
the webpage excluding the URL.

Note: A Fliter like "reduce URL generally of "/index"" is no solution 
because in other cases of the same run "/index" may be needed or the 
same Webpage can be adressed also by other URL Syntax.

Thanx

Detlef Müller-Solger

Re: Doublets

Posted by Julien Nioche <li...@gmail.com>.

Hi,

I haven't used it myself but it looks like the *dedup* command (
http://wiki.apache.org/nutch/bin/nutch_dedup) uses the signature of the
documents to remove duplicates. That should work fine in the case you are
describing in combination with Jasper's suggestion which would prevent
fetching some of the duplicates in the first place

Julien

-- 
DigitalPebble Ltd
http://www.digitalpebble.com


> On Oct 8, 2008, at 4:23 AM, Detlef Müller-Solger wrote:
>
>  Hi,
>>
>> in Germany it is reported, that one big show stopper for Nutch is the
>> fact, that there for example are often identical webpage's which can be
>> addressed by different URLs. For example by requesting
>>
>> www.xyz.de/information
>> or by
>> www.xyz.de/information/
>> or by
>> www.xyz.de/information/index
>>
>> From my point of view due to the different URLs Nutch is indexing those
>> webpages unfortuneately three times. Is there a method to avoid the indexing
>> of these doublets? For example by comparing all information of the webpage
>> excluding the URL.
>>
>> Note: A Fliter like "reduce URL generally of "/index"" is no solution
>> because in other cases of the same run "/index" may be needed or the same
>> Webpage can be adressed also by other URL Syntax.
>>
>> Thanx
>>
>> Detlef Müller-Solger
>>
>>
>>
>

Re: Doublets

Posted by Jasper Kamperman <ja...@openwaternet.com>.

There is a url normalizing feature in

	conf/regex-normalize.xml

For example I used the following pattern to say that pages with  
&limit and &limitstart are the same on a JasperSoft forum

<!-- Jasper normalize jaspersoft URLs:
      (1) &limit=6&limitstart=0 means the same as the page w/o any limit
      (2) catid=10&id=NNN means the same as id=NNN&catid=10  -->
<regex-normalize>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)limit=6(\&amp;|\&amp;amp;) 
limitstart=0$</pattern>
   <substitution></substitution>
</regex>
<regex>
   <pattern>(\?|\&amp;|\&amp;amp;)(id=[0-9]+)(\&amp;|\&amp;amp;) 
(catid=10)(.*)</pattern>
   <substitution>$1$4$3$2$5</substitution>
</regex>
</regex-normalize>


On Oct 8, 2008, at 4:23 AM, Detlef Müller-Solger wrote:

> Hi,
>
> in Germany it is reported, that one big show stopper for Nutch is  
> the fact, that there for example are often identical webpage’s  
> which can be addressed by different URLs. For example by requesting
>
> www.xyz.de/information
> or by
> www.xyz.de/information/
> or by
> www.xyz.de/information/index
>
> From my point of view due to the different URLs Nutch is indexing  
> those webpages unfortuneately three times. Is there a method to  
> avoid the indexing of these doublets? For example by comparing all  
> information of the webpage excluding the URL.
>
> Note: A Fliter like "reduce URL generally of "/index"" is no  
> solution because in other cases of the same run "/index" may be  
> needed or the same Webpage can be adressed also by other URL Syntax.
>
> Thanx
>
> Detlef Müller-Solger
>
>