You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Edward Quick <ed...@hotmail.com> on 2008/09/25 13:29:30 UTC
pages with duplicate content in search results
Hi,
Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these?
Thanks,
Ed.
Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
Item Document :- Client - TeraTerm Pro
... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
RE: pages with duplicate content in search results
Posted by Edward Quick <ed...@hotmail.com>.
> >
> > Dennis,
> > I am facing same problem, in my crawl content of some urls are
> > same but urls are different. Could you please tell me how I can set
> > hitsPersite to 1 . ?
>
> I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' button). It might be possible to set this in the web.xml or nutch-site.xml though?
>
> >
> > --Vishal
> >
> > On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <ku...@apache.org> wrote:
> >
> > > If you are using more than one index then dedup will not work across
> > > indexes. A single index should dedup correctly unless the pages are not
> > > exact duplicates but near duplicates. The dedup process works on url and
> > > byte hash. If the content is even 1 byte different, it doesn't work.
>
>
> I only have one index, and have only crawled one domain site which is the Intranet at my work.
> The pages definitely seem to be identical. I saved the source from both pages and the sizes were exactly the same too.
Also, just to add to this I checked the index with Luke which shows the two urls below with the same titles but different timestamps, digests and boosts. :-(
>
>
> > >
> > > Near duplicate detection is another set of algorithms that haven't been
> > > implemented in Nutch yet. On the query site you can set hte hitsPerSite to
> > > 1 and it should limit your search results.
> > >
> > > Dennis
> > >
> > >
> > > Edward Quick wrote:
> > >
> > >> Hi,
> > >>
> > >> Eventhough I ran nutch dedup on my index, I still have pages with
> > >> different urls but the exactly the same content (see search result example
> > >> below). From what I read up on dedup this shouldn't happen though as it
> > >> deletes the url with the lowest score. Is there anything else I can try to
> > >> get rid of these?
> > >>
> > >> Thanks,
> > >> Ed.
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online Employee Self Service ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >>
> > >>
> > >>
> > >> Item Document :- Client - TeraTerm Pro
> > >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> > >> Online Employee Self Service ESS Home ... Description Document
> > >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> > >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> > >> printing or keymapping is an issue, TeraTerm ...
> > >>
> > >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> > >> _________________________________________________________________
> > >> Make a mini you and download it into Windows Live Messenger
> > >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> > >>
> > >
>
> _________________________________________________________________
> Get all your favourite content with the slick new MSN Toolbar - FREE
> http://clk.atdmt.com/UKM/go/111354027/direct/01/
_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/
RE: pages with duplicate content in search results
Posted by Edward Quick <ed...@hotmail.com>.
> Date: Thu, 25 Sep 2008 21:10:52 +0530
> From: vishal.ce@gmail.com
> To: nutch-user@lucene.apache.org
> Subject: Re: pages with duplicate content in search results
>
> Dennis,
> I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?
I changed hitsPerSite to 0 in the search.jsp (to get rid of the 'show all hits' button). It might be possible to set this in the web.xml or nutch-site.xml though?
>
> --Vishal
>
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <ku...@apache.org> wrote:
>
> > If you are using more than one index then dedup will not work across
> > indexes. A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates. The dedup process works on url and
> > byte hash. If the content is even 1 byte different, it doesn't work.
I only have one index, and have only crawled one domain site which is the Intranet at my work.
The pages definitely seem to be identical. I saved the source from both pages and the sizes were exactly the same too.
> >
> > Near duplicate detection is another set of algorithms that haven't been
> > implemented in Nutch yet. On the query site you can set hte hitsPerSite to
> > 1 and it should limit your search results.
> >
> > Dennis
> >
> >
> > Edward Quick wrote:
> >
> >> Hi,
> >>
> >> Eventhough I ran nutch dedup on my index, I still have pages with
> >> different urls but the exactly the same content (see search result example
> >> below). From what I read up on dedup this shouldn't happen though as it
> >> deletes the url with the lowest score. Is there anything else I can try to
> >> get rid of these?
> >>
> >> Thanks,
> >> Ed.
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online Employee Self Service ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> >>
> >>
> >>
> >> Item Document :- Client - TeraTerm Pro
> >> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
> >> Online Employee Self Service ESS Home ... Description Document
> >> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
> >> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
> >> printing or keymapping is an issue, TeraTerm ...
> >>
> >> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
> >> _________________________________________________________________
> >> Make a mini you and download it into Windows Live Messenger
> >> http://clk.atdmt.com/UKM/go/111354029/direct/01/
> >>
> >
_________________________________________________________________
Get all your favourite content with the slick new MSN Toolbar - FREE
http://clk.atdmt.com/UKM/go/111354027/direct/01/
Re: pages with duplicate content in search results
Posted by vishal vachhani <vi...@gmail.com>.
thank you very much!!!!!!
On Thu, Sep 25, 2008 at 9:26 PM, Dennis Kubes <ku...@apache.org> wrote:
> In search.jsp lines 116-119:
>
> int hitsPerSite = 2; // max hits per site
> String hitsPerSiteString = request.getParameter("hitsPerSite");
> if (hitsPerSiteString != null)
> hitsPerSite = Integer.parseInt(hitsPerSiteString);
>
> Hope that helps.
>
> Dennis
>
>
> vishal vachhani wrote:
>
>> Dennis,
>> I am facing same problem, in my crawl content of some urls are
>> same but urls are different. Could you please tell me how I can set
>> hitsPersite to 1 . ?
>>
>> --Vishal
>>
>> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <ku...@apache.org> wrote:
>>
>> If you are using more than one index then dedup will not work across
>>> indexes. A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates. The dedup process works on url and
>>> byte hash. If the content is even 1 byte different, it doesn't work.
>>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet. On the query site you can set hte hitsPerSite
>>> to
>>> 1 and it should limit your search results.
>>>
>>> Dennis
>>>
>>>
>>> Edward Quick wrote:
>>>
>>> Hi,
>>>>
>>>> Eventhough I ran nutch dedup on my index, I still have pages with
>>>> different urls but the exactly the same content (see search result
>>>> example
>>>> below). From what I read up on dedup this shouldn't happen though as it
>>>> deletes the url with the lowest score. Is there anything else I can try
>>>> to
>>>> get rid of these?
>>>>
>>>> Thanks,
>>>> Ed.
>>>>
>>>> Item Document :- Client - TeraTerm Pro
>>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical
>>>> Standards
>>>> Online Employee Self Service ESS Home ... Description Document
>>>> Technology Category: Client Name of item: TeraTerm Pro Related policy:
>>>> Unix
>>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool.
>>>> Where
>>>> printing or keymapping is an issue, TeraTerm ...
>>>>
>>>>
>>>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)<http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument%28cached%29>(explain) (anchors)
>>>>
>>>>
>>>>
>>>> Item Document :- Client - TeraTerm Pro
>>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical
>>>> Standards
>>>> Online Employee Self Service ESS Home ... Description Document
>>>> Technology Category: Client Name of item: TeraTerm Pro Related policy:
>>>> Unix
>>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool.
>>>> Where
>>>> printing or keymapping is an issue, TeraTerm ...
>>>>
>>>>
>>>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached)<http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument%28cached%29>(explain) (anchors)
>>>> _________________________________________________________________
>>>> Make a mini you and download it into Windows Live Messenger
>>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>>>
>>>>
>>
--
Thanks and Regards,
Vishal Vachhani
M.tech, CSE dept
Indian Institute of Technology, Bombay
http://www.cse.iitb.ac.in/~vishalv
Re: pages with duplicate content in search results
Posted by Dennis Kubes <ku...@apache.org>.
In search.jsp lines 116-119:
int hitsPerSite = 2; // max hits per site
String hitsPerSiteString = request.getParameter("hitsPerSite");
if (hitsPerSiteString != null)
hitsPerSite = Integer.parseInt(hitsPerSiteString);
Hope that helps.
Dennis
vishal vachhani wrote:
> Dennis,
> I am facing same problem, in my crawl content of some urls are
> same but urls are different. Could you please tell me how I can set
> hitsPersite to 1 . ?
>
> --Vishal
>
> On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <ku...@apache.org> wrote:
>
>> If you are using more than one index then dedup will not work across
>> indexes. A single index should dedup correctly unless the pages are not
>> exact duplicates but near duplicates. The dedup process works on url and
>> byte hash. If the content is even 1 byte different, it doesn't work.
>>
>> Near duplicate detection is another set of algorithms that haven't been
>> implemented in Nutch yet. On the query site you can set hte hitsPerSite to
>> 1 and it should limit your search results.
>>
>> Dennis
>>
>>
>> Edward Quick wrote:
>>
>>> Hi,
>>>
>>> Eventhough I ran nutch dedup on my index, I still have pages with
>>> different urls but the exactly the same content (see search result example
>>> below). From what I read up on dedup this shouldn't happen though as it
>>> deletes the url with the lowest score. Is there anything else I can try to
>>> get rid of these?
>>>
>>> Thanks,
>>> Ed.
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online Employee Self Service ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>>
>>>
>>>
>>> Item Document :- Client - TeraTerm Pro
>>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>>> Online Employee Self Service ESS Home ... Description Document
>>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>>> printing or keymapping is an issue, TeraTerm ...
>>>
>>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>> _________________________________________________________________
>>> Make a mini you and download it into Windows Live Messenger
>>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>>
>
Re: pages with duplicate content in search results
Posted by vishal vachhani <vi...@gmail.com>.
Dennis,
I am facing same problem, in my crawl content of some urls are
same but urls are different. Could you please tell me how I can set
hitsPersite to 1 . ?
--Vishal
On Thu, Sep 25, 2008 at 6:12 PM, Dennis Kubes <ku...@apache.org> wrote:
> If you are using more than one index then dedup will not work across
> indexes. A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates. The dedup process works on url and
> byte hash. If the content is even 1 byte different, it doesn't work.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet. On the query site you can set hte hitsPerSite to
> 1 and it should limit your search results.
>
> Dennis
>
>
> Edward Quick wrote:
>
>> Hi,
>>
>> Eventhough I ran nutch dedup on my index, I still have pages with
>> different urls but the exactly the same content (see search result example
>> below). From what I read up on dedup this shouldn't happen though as it
>> deletes the url with the lowest score. Is there anything else I can try to
>> get rid of these?
>>
>> Thanks,
>> Ed.
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online Employee Self Service ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>>
>>
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online Employee Self Service ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument(cached) (explain) (anchors)
>> _________________________________________________________________
>> Make a mini you and download it into Windows Live Messenger
>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>>
>
Re: pages with duplicate content in search results
Posted by Andrzej Bialecki <ab...@getopt.org>.
Edward Quick wrote:
>
>> Dennis Kubes wrote:
>>> If you are using more than one index then dedup will not work across
>>> indexes.
>> This is incorrect. DeleteDuplicates works just fine with multiple
>> indexes, assuming you process all indexes in the same run of
>> DeleteDuplicates, so that it has a global view of all input indexes.
>>
>> A single index should dedup correctly unless the pages are not
>>> exact duplicates but near duplicates. The dedup process works on url
>>> and byte hash. If the content is even 1 byte different, it doesn't work.
>> This depends on the implementation of Signature. Indeed, the default
>> MD5HashSignature works this way.
>>
>>> Near duplicate detection is another set of algorithms that haven't been
>>> implemented in Nutch yet.
>> Well, the existing TextProfileSignature can be used as a form of (crude)
>> near-duplicate detection, precisely because it is tolerant to small
>> changes in the input text.
>
> Thanks Andrzej.
> How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?
See the following property in your nutch-site.xml: db.signature.class.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
RE: pages with duplicate content in search results
Posted by Edward Quick <ed...@hotmail.com>.
>
> Dennis Kubes wrote:
> > If you are using more than one index then dedup will not work across
> > indexes.
>
> This is incorrect. DeleteDuplicates works just fine with multiple
> indexes, assuming you process all indexes in the same run of
> DeleteDuplicates, so that it has a global view of all input indexes.
>
> A single index should dedup correctly unless the pages are not
> > exact duplicates but near duplicates. The dedup process works on url
> > and byte hash. If the content is even 1 byte different, it doesn't work.
>
> This depends on the implementation of Signature. Indeed, the default
> MD5HashSignature works this way.
>
> >
> > Near duplicate detection is another set of algorithms that haven't been
> > implemented in Nutch yet.
>
> Well, the existing TextProfileSignature can be used as a form of (crude)
> near-duplicate detection, precisely because it is tolerant to small
> changes in the input text.
Thanks Andrzej.
How do you tell Nutch to use the TextProfileSignature instead of MD5HashSignature for deduplicating?
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/
Re: pages with duplicate content in search results
Posted by Andrzej Bialecki <ab...@getopt.org>.
Dennis Kubes wrote:
> If you are using more than one index then dedup will not work across
> indexes.
This is incorrect. DeleteDuplicates works just fine with multiple
indexes, assuming you process all indexes in the same run of
DeleteDuplicates, so that it has a global view of all input indexes.
A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates. The dedup process works on url
> and byte hash. If the content is even 1 byte different, it doesn't work.
This depends on the implementation of Signature. Indeed, the default
MD5HashSignature works this way.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet.
Well, the existing TextProfileSignature can be used as a form of (crude)
near-duplicate detection, precisely because it is tolerant to small
changes in the input text.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
Re: pages with duplicate content in search results
Posted by David Jashi <da...@jashi.ge>.
Sorry for off-topic, but how do you make Nutch-0.9 search multiple indexes?
On Thu, Sep 25, 2008 at 4:42 PM, Dennis Kubes <ku...@apache.org> wrote:
> If you are using more than one index then dedup will not work across
> indexes. A single index should dedup correctly unless the pages are not
> exact duplicates but near duplicates. The dedup process works on url and
> byte hash. If the content is even 1 byte different, it doesn't work.
>
> Near duplicate detection is another set of algorithms that haven't been
> implemented in Nutch yet. On the query site you can set hte hitsPerSite to
> 1 and it should limit your search results.
>
> Dennis
>
> Edward Quick wrote:
>>
>> Hi,
>>
>> Eventhough I ran nutch dedup on my index, I still have pages with
>> different urls but the exactly the same content (see search result example
>> below). From what I read up on dedup this shouldn't happen though as it
>> deletes the url with the lowest score. Is there anything else I can try to
>> get rid of these?
>>
>> Thanks,
>> Ed.
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online Employee Self Service ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
>> (cached) (explain) (anchors)
>>
>>
>>
>> Item Document :- Client - TeraTerm Pro
>> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards
>> Online Employee Self Service ESS Home ... Description Document
>> Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix
>> Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where
>> printing or keymapping is an issue, TeraTerm ...
>>
>> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument
>> (cached) (explain) (anchors)
>> _________________________________________________________________
>> Make a mini you and download it into Windows Live Messenger
>> http://clk.atdmt.com/UKM/go/111354029/direct/01/
>
--
with best regards,
David Jashi
Web development EO,
Caucasus Online
+995(32)970368
David@Jashi.ge
პატივისცემით,
დავით ჯაში
ვებ–განვითარების დირექტორი
"კავკასუს ონლაინი"
+995(32)970368
David@Jashi.ge
Re: pages with duplicate content in search results
Posted by Dennis Kubes <ku...@apache.org>.
If you are using more than one index then dedup will not work across
indexes. A single index should dedup correctly unless the pages are not
exact duplicates but near duplicates. The dedup process works on url
and byte hash. If the content is even 1 byte different, it doesn't work.
Near duplicate detection is another set of algorithms that haven't been
implemented in Nutch yet. On the query site you can set hte hitsPerSite
to 1 and it should limit your search results.
Dennis
Edward Quick wrote:
> Hi,
>
> Eventhough I ran nutch dedup on my index, I still have pages with different urls but the exactly the same content (see search result example below). From what I read up on dedup this shouldn't happen though as it deletes the url with the lowest score. Is there anything else I can try to get rid of these?
>
> Thanks,
> Ed.
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/8918e269a19be23f802563ef004e8e7a/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
>
>
> Item Document :- Client - TeraTerm Pro
> ... Item Document :- Client - TeraTerm Pro Intranet - Technical Standards Online Employee Self Service ESS Home ... Description Document Technology Category: Client Name of item: TeraTerm Pro Related policy: Unix Access Tool Vendor: Current Technical Status ... standard Telnet tool. Where printing or keymapping is an issue, TeraTerm ...
> http://www.somedomain.com/im/tech/technica.nsf/dacff06c3e1dbc9780257273004e1e3b/441cdf92bbe06a9e80256c87003d81d9?OpenDocument (cached) (explain) (anchors)
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/