You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Nuther <nu...@proservice.ge> on 2007/07/06 09:06:40 UTC

Re[2]: site alias

Hi, Susam.

But that's wrong. Your solution is the easiest way to get rid of duplicates
If you know DataParkSearch engine, it has this option.
So, is the usage of url filter the only way to avoid duplicates?
Or is there any way to code this feature, and if so, then how?

> I have faced this issue. I block the duplicate domain using the URL
> filters. So only one domain is crawled by the bot and the other domain
> is ignored.

> Regards,
> Susam Pal
> http://susam.in/

> On 7/6/07, Nuther <nu...@proservice.ge> wrote:
>> Hi,
>> I was wondering if nutch has alias option
>> Let's say we have two domains www.site1.com and www.site2.com that point on
>> one site. How can I tell nutch that they pooint on that site? This is problem
>> because there are a lot of duplicates in search results.
>> Thanks.

>> --
>> Regards,
>>  Nuther                          mailto:nuther@proservice.ge



-- 
Regards,
 Nuther                          mailto:nuther@proservice.ge

Re: Re[2]: site alias

Posted by Susam Pal <su...@gmail.com>.

Hi Nuther,

I am not sure whether this is the only way to solve this problem but
this works very well for me in an Intranet.

Which one of the following two do you want to achieve by coding?

1. Block one domain name completely.
2. Allow both the domain names but remember that both point to the
same resource. So when a page is obtained from one domain, keep a note
of it and do not request the same page from another domain.

I don't think coding for the point 1 is a good idea because that can
already be achieved through URL filters.

For point 2, a good starting point would be
src/java/org/apache/nutch/crawl/Generator.java and
src/java/org/apache/nutch/fetcher/Fetcher.java

Regards,
Susam Pal
http://susam.in/

On 7/6/07, Nuther <nu...@proservice.ge> wrote:
> Hi, Susam.
>
> But that's wrong. Your solution is the easiest way to get rid of duplicates
> If you know DataParkSearch engine, it has this option.
> So, is the usage of url filter the only way to avoid duplicates?
> Or is there any way to code this feature, and if so, then how?
>
> > I have faced this issue. I block the duplicate domain using the URL
> > filters. So only one domain is crawled by the bot and the other domain
> > is ignored.
>
> > Regards,
> > Susam Pal
> > http://susam.in/
>
> > On 7/6/07, Nuther <nu...@proservice.ge> wrote:
> >> Hi,
> >> I was wondering if nutch has alias option
> >> Let's say we have two domains www.site1.com and www.site2.com that point on
> >> one site. How can I tell nutch that they pooint on that site? This is problem
> >> because there are a lot of duplicates in search results.
> >> Thanks.
>
> >> --
> >> Regards,
> >>  Nuther                          mailto:nuther@proservice.ge