You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Dennis Kubes <ku...@apache.org> on 2008/07/23 16:54:36 UTC

Re: Dedup Question

It will remove the one with the lowest score in the crawldb as set by 
the scoring filters.  Dedup first removes by url then by content hash. 
If the content is changed even slightly though it will *not* be detected 
as a duplicate.  Solving that problem is called near duplicate detection 
(ndd) and uses an algorithm called shingling which isn't currently 
implemented in Nutch (but hopefully will be in the near future).

Dennis

Patrick Markiewicz wrote:
> Hi,
> 
>             If I have a url http://www.example.com/index.html stored in
> my index with the content: EMPTY FILE, and I have a file
> http://www.domain.com/index.html with the content: EMPTY FILE, then the
> two files are duplicates.  Which one will the de-duplication process
> remove from the index?  Thanks.
> 
>  
> 
> Patrick
> 
>

url index

Posted by Marcel T <md...@hotmail.com>.

I'm trying to use an URL to get all indexed information about that URL, but didn't find any API. any hint? thanks!

RE: Dedup Question

Posted by Devang Shah <de...@gmail.com>.

>> Actually a custom plugin wouldn't work in this instance because it
wouldn't affect the document boost score.

Is that true? Doesn't indexerScore method in opic-scoring affect index boost
score - or methods like updatedbscore etc to update datum score?

-D.

-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org] 
Sent: Wednesday, July 23, 2008 11:36 AM
To: nutch-user@lucene.apache.org
Subject: Re: Dedup Question

Actually a custom plugin wouldn't work in this instance because it 
wouldn't affect the document boost score.  You would need to operate on 
the crawldb directly or have a different indexer.  I will send you a 
hacked out ArbitraryIndexer that uses RPN to arbitrarily boost scores.

That being said I have completed work on a new scoring and indexing 
framework which stabilizes link scores and makes indexing much more 
flexible.  That should be released very soon.

Dennis

Patrick Markiewicz wrote:
> Is there a way to configure nutch's scoring-opic plugin to bump up the
> score of a particular domain?  Or does it require a custom scoring
> plugin to do so?  Thanks.
> 
> Patrick
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:kubes@apache.org] 
> Sent: Wednesday, July 23, 2008 10:55 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Dedup Question
> 
> It will remove the one with the lowest score in the crawldb as set by 
> the scoring filters.  Dedup first removes by url then by content hash. 
> If the content is changed even slightly though it will *not* be detected
> 
> as a duplicate.  Solving that problem is called near duplicate detection
> 
> (ndd) and uses an algorithm called shingling which isn't currently 
> implemented in Nutch (but hopefully will be in the near future).
> 
> Dennis
> 
> Patrick Markiewicz wrote:
>> Hi,
>>
>>             If I have a url http://www.example.com/index.html stored
> in
>> my index with the content: EMPTY FILE, and I have a file
>> http://www.domain.com/index.html with the content: EMPTY FILE, then
> the
>> two files are duplicates.  Which one will the de-duplication process
>> remove from the index?  Thanks.
>>
>>  
>>
>> Patrick
>>
>>

Re: Dedup Question

Posted by Dennis Kubes <ku...@apache.org>.

Actually a custom plugin wouldn't work in this instance because it 
wouldn't affect the document boost score.  You would need to operate on 
the crawldb directly or have a different indexer.  I will send you a 
hacked out ArbitraryIndexer that uses RPN to arbitrarily boost scores.

That being said I have completed work on a new scoring and indexing 
framework which stabilizes link scores and makes indexing much more 
flexible.  That should be released very soon.

Dennis

Patrick Markiewicz wrote:
> Is there a way to configure nutch's scoring-opic plugin to bump up the
> score of a particular domain?  Or does it require a custom scoring
> plugin to do so?  Thanks.
> 
> Patrick
> 
> -----Original Message-----
> From: Dennis Kubes [mailto:kubes@apache.org] 
> Sent: Wednesday, July 23, 2008 10:55 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: Dedup Question
> 
> It will remove the one with the lowest score in the crawldb as set by 
> the scoring filters.  Dedup first removes by url then by content hash. 
> If the content is changed even slightly though it will *not* be detected
> 
> as a duplicate.  Solving that problem is called near duplicate detection
> 
> (ndd) and uses an algorithm called shingling which isn't currently 
> implemented in Nutch (but hopefully will be in the near future).
> 
> Dennis
> 
> Patrick Markiewicz wrote:
>> Hi,
>>
>>             If I have a url http://www.example.com/index.html stored
> in
>> my index with the content: EMPTY FILE, and I have a file
>> http://www.domain.com/index.html with the content: EMPTY FILE, then
> the
>> two files are duplicates.  Which one will the de-duplication process
>> remove from the index?  Thanks.
>>
>>  
>>
>> Patrick
>>
>>

RE: Dedup Question

Posted by Patrick Markiewicz <pm...@sim-gtech.com>.

Is there a way to configure nutch's scoring-opic plugin to bump up the
score of a particular domain?  Or does it require a custom scoring
plugin to do so?  Thanks.

Patrick

-----Original Message-----
From: Dennis Kubes [mailto:kubes@apache.org] 
Sent: Wednesday, July 23, 2008 10:55 AM
To: nutch-user@lucene.apache.org
Subject: Re: Dedup Question

It will remove the one with the lowest score in the crawldb as set by 
the scoring filters.  Dedup first removes by url then by content hash. 
If the content is changed even slightly though it will *not* be detected

as a duplicate.  Solving that problem is called near duplicate detection

(ndd) and uses an algorithm called shingling which isn't currently 
implemented in Nutch (but hopefully will be in the near future).

Dennis

Patrick Markiewicz wrote:
> Hi,
> 
>             If I have a url http://www.example.com/index.html stored
in
> my index with the content: EMPTY FILE, and I have a file
> http://www.domain.com/index.html with the content: EMPTY FILE, then
the
> two files are duplicates.  Which one will the de-duplication process
> remove from the index?  Thanks.
> 
>  
> 
> Patrick
> 
>