You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by "Nemani, Raj" <Ra...@turner.com> on 2010/09/23 22:11:44 UTC

Duplicate URLs

All,

 

I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  

 

 

http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm

 

http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm

 

 

As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.

 

Thanks

Raj

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

Thank you so much!.  Based on the conversation you are having in another
thread that deals with OutOfmemmory exceptions during SolrDedup, I may
have to investigate deduping on the solr side.  My index is 3.2 million
documents and constantly growing at a considerable rate.

Thanks again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 7:29 AM
To: user@nutch.apache.org
Subject: Re: Duplicate URLs


On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
> My solr index has sources other than the data generated from Nutch
crawls. 
>  What this means is that when I do solrDedup from Nutch, the dedup
process
>  will happen across the entire solr Index, not just on the documents
>  generated and submitted by Nutch, Am I correct?

Correct.

> 
> Is there a way I can have the deduping done on the Nutch side before
>  sending the data set to Solr even if it means I need to generate the
Nutch
>  index.  Just to reiterate my dupes are based on the content, not on
the
>  URL.

I'm not sure. You'll need a Nutch index to deduplicate first. But it's
the 
index that will be deduplicated, not the parsed segments. Sending stuff
to 
Solr then would not be very helpful.

> 
> On the other hand it looks like you have to supply the Nutch index
>  directory to Nutch dedup command, not the segments directory.  Here
are
>  the Hadoop log entries. Could the documentation be wrong?  Note that
I
>  have not generated the Nutch index.  After merging the segements and
>  inverting the links, I just called the Dedup on my segments
directory.  It
>  did not seem to do anything.  Do I have to build the Nutch Index and
then
>  call the dedup on the segments directory?

Nutch dedup command required a parameter pointing to an index, you'll
need an 
index in Nutch to dedup.

> 
> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup:
starting at
>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO
indexer.DeleteDuplicates
>  - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792
WARN 
>  mapred.FileInputFormat - Can't open index at
>
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+
21
> 47483647, skipping. (no segments* file found in
>
org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch
-1
> .2/crawl/segments/20100923174134: files: [content, crawl_fetch,
>  crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23
>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at
>  2010-09-23 17:42:45, elapsed: 00:00:05

That's the segments* doing there?  It shouldn't.

> 
> Thanks for all your help
> Raj
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:52 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> bin/nutch solrdedup
> Usage: SolrDeleteDuplicates <solr url>
> 
>  
> 
> You could also handle deduplication in your Solr configuration. It
exposes
>  more options and lets you mark duplicates (documents with identical
>  signatures) or overwrite them (deduplicate).
> 
>  
> 
> http://wiki.apache.org/solr/Deduplication
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:48
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Thanks again.  One final question.  I do not create Nutch index.  I
just
>  push the crawl segments to Solr using the follwing command line.  
> 
> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
>  crawl/segments/*
> 
> Do I need to create Nutch index to get the Dedup going because I saw
online
>  script that submits the nutch Index directory to Dedup command.  Can
I
>  just pass in the Segments directory (as shown in the document from
the
>  link you sent) without having to build the Nutch index?
> 
> I am going to try both ways in the mean time.
> 
> Thanks so much again
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:33 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Deduplication is a mechanism where a hash is being generated based on
>  contents of some field (title and/or content as the usual). It can be
as
>  simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate
itself
>  by using that command line option. You can also use Nutch to
deduplicate
>  whatever you pushed to a Solr index, and you can configure Solr to
>  deduplicate as well.
> 
>  
> 
> http://wiki.apache.org/nutch/CommandLineOptions
> 
>  
> 
> 
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:26
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Markus,
> 
> Thanks so much.
> Any link that outlines the step to take that you can forward or just
>  explain if you can.  I appreciate your help.  I will keep looking
online
>  in the meantime.
> 
> Thanks
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:20 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Use deduplication.
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:12
> To: user@nutch.apache.org;
> Subject: Duplicate URLs
> 
> All,
> 
> 
> 
> I just wanted to see if there is way we can tell Nutch to treat the
> following URLs as same.  
> 
> 
> 
> 
> 
>
http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
> _action.htm
> 
> 
> 
> http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
> 
> 
> 
> 
> 
> As you know you can set up web servers such that both the URLs above
> resolve to the same end point.  In other words the two URLs are
actually
> *same* even though they are physically different.  Is there anyway I
can
> tell NUTCH to treat these URLs as same?
> 
> I cannot use to filtering to ignore one or the other (wither with
> DOMAINNAME or without) because I need to allow both patterns to allow
> genuine URLs.
> 
> 
> 
> Thanks
> 
> Raj
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

Hi Markus,

 

As you can see, I have used the "Digest" field as the source filed for
the processor and then stored the digest in a newly added filed to the
schma called 'Sig".  I am still testing my index to make sure what I did
not create any un-intended results.  OTOH, is there way I can tell the
processor to just *use* the "digest" filed for dedupe process with me
having to create a new 'sig" filed to store the digest of the 'Digest"
filed?

 

Thanks for your continued help

Raj

 

 

________________________________

From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Sunday, September 26, 2010 1:31 PM
To: user@nutch.apache.org; Nemani, Raj
Subject: RE: Duplicate URLs

 

Nutch has a fuzzy hashing algorithm for generating digests for a
document. Solr incorporates the TextProfileSignature that comes from
Nutch. I'm not sure if the digest field is generated by this algoritm,
if it is, it makes sense to use that for deduplication. If the digest
field is generated by an exact hashing algoritm such as MD5, it won't
allow you do use the TextProfileSignature algoritm in Solr for fuzzy
matching.
 

	-----Original message-----
	From: Nemani, Raj <Ra...@turner.com>
	Sent: Fri 24-09-2010 23:18
	To: user@nutch.apache.org; Markus Jelsma
<ma...@buyways.nl>; 
	Subject: RE: Duplicate URLs
	
	So I used to Solr deduping in the end by configuring Solr for
Deduping
	in SolrConfig.xml.  Here is what I ended up doing.  I noticed
that the
	digest field generated by Nutch for the two URLs I mentioned is
same.
	So I used that as the filed and created new Signature field in
the
	schma.xml.  Here are my config changes from SolConfig.xml.  It
does feel
	weird to use he digest filed for this purpose.  Does this make
sense?  
	
	SolrConfig.xml
	---------------------------
	
	
	
	<updateRequestProcessorChain name="dedupe">
	    <processor
	
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
	>
	      <bool name="enabled">true</bool>
	      <str name="signatureField">sig</str>
	      <bool name="overwriteDupes">true</bool>
	      <str
	
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
	/str> 
	  <str name="fields">digest</str>
	  </processor>
	    <processor class="solr.LogUpdateProcessorFactory" />
	    <processor class="solr.RunUpdateProcessorFactory" />
	  </updateRequestProcessorChain>
	
	
	<requestHandler name="/update"
class="solr.XmlUpdateRequestHandler" >
	   <lst name="defaults">
	     <str name="update.processor">dedupe</str>
	   </lst>
	 </requestHandler>
	
	Schema.xml
	--------------------
	
	<field name="sig" type="string" stored="true" indexed="true"
	multiValued="true" />
	
	-----Original Message-----
	From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
	Sent: Friday, September 24, 2010 7:29 AM
	To: user@nutch.apache.org
	Subject: Re: Duplicate URLs
	
	
	On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
	> My solr index has sources other than the data generated from
Nutch
	crawls. 
	>  What this means is that when I do solrDedup from Nutch, the
dedup
	process
	>  will happen across the entire solr Index, not just on the
documents
	>  generated and submitted by Nutch, Am I correct?
	
	Correct.
	
	> 
	> Is there a way I can have the deduping done on the Nutch side
before
	>  sending the data set to Solr even if it means I need to
generate the
	Nutch
	>  index.  Just to reiterate my dupes are based on the content,
not on
	the
	>  URL.
	
	I'm not sure. You'll need a Nutch index to deduplicate first.
But it's
	the 
	index that will be deduplicated, not the parsed segments.
Sending stuff
	to 
	Solr then would not be very helpful.
	
	> 
	> On the other hand it looks like you have to supply the Nutch
index
	>  directory to Nutch dedup command, not the segments directory.
Here
	are
	>  the Hadoop log entries. Could the documentation be wrong?
Note that
	I
	>  have not generated the Nutch index.  After merging the
segements and
	>  inverting the links, I just called the Dedup on my segments
	directory.  It
	>  did not seem to do anything.  Do I have to build the Nutch
Index and
	then
	>  call the dedup on the segments directory?
	
	Nutch dedup command required a parameter pointing to an index,
you'll
	need an 
	index in Nutch to dedup.
	
	> 
	> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates -
Dedup:
	starting at
	>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO
	indexer.DeleteDuplicates
	>  - Dedup: adding indexes in: crawl/segments 2010-09-23
17:42:40,792
	WARN 
	>  mapred.FileInputFormat - Can't open index at
	>
	
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+
	21
	> 47483647, skipping. (no segments* file found in
	>
	
org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch
	-1
	> .2/crawl/segments/20100923174134: files: [content,
crawl_fetch,
	>  crawl_generate, crawl_parse, parse_data, parse_text])
2010-09-23
	>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished
at
	>  2010-09-23 17:42:45, elapsed: 00:00:05
	
	That's the segments* doing there?  It shouldn't.
	
	> 
	> Thanks for all your help
	> Raj
	> 
	> 
	> 
	> -----Original Message-----
	> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
	> Sent: Thursday, September 23, 2010 4:52 PM
	> To: user@nutch.apache.org
	> Subject: RE: Duplicate URLs
	> 
	> bin/nutch solrdedup
	> Usage: SolrDeleteDuplicates <solr url>
	> 
	>  
	> 
	> You could also handle deduplication in your Solr
configuration. It
	exposes
	>  more options and lets you mark duplicates (documents with
identical
	>  signatures) or overwrite them (deduplicate).
	> 
	>  
	> 
	> http://wiki.apache.org/solr/Deduplication
	>  
	> -----Original message-----
	> From: Nemani, Raj <Ra...@turner.com>
	> Sent: Thu 23-09-2010 22:48
	> To: user@nutch.apache.org;
	> Subject: RE: Duplicate URLs
	> 
	> Thanks again.  One final question.  I do not create Nutch
index.  I
	just
	>  push the crawl segments to Solr using the follwing command
line.  
	> 
	> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
	>  crawl/segments/*
	> 
	> Do I need to create Nutch index to get the Dedup going because
I saw
	online
	>  script that submits the nutch Index directory to Dedup
command.  Can
	I
	>  just pass in the Segments directory (as shown in the document
from
	the
	>  link you sent) without having to build the Nutch index?
	> 
	> I am going to try both ways in the mean time.
	> 
	> Thanks so much again
	> Raj
	> 
	> 
	> -----Original Message-----
	> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
	> Sent: Thursday, September 23, 2010 4:33 PM
	> To: user@nutch.apache.org
	> Subject: RE: Duplicate URLs
	> 
	> Deduplication is a mechanism where a hash is being generated
based on
	>  contents of some field (title and/or content as the usual).
It can be
	as
	>  simple as an MD5 hash or a more fuzzy match. Nutch can
deduplicate
	itself
	>  by using that command line option. You can also use Nutch to
	deduplicate
	>  whatever you pushed to a Solr index, and you can configure
Solr to
	>  deduplicate as well.
	> 
	>  
	> 
	> http://wiki.apache.org/nutch/CommandLineOptions
	> 
	>  
	> 
	> 
	>  
	> -----Original message-----
	> From: Nemani, Raj <Ra...@turner.com>
	> Sent: Thu 23-09-2010 22:26
	> To: user@nutch.apache.org;
	> Subject: RE: Duplicate URLs
	> 
	> Markus,
	> 
	> Thanks so much.
	> Any link that outlines the step to take that you can forward
or just
	>  explain if you can.  I appreciate your help.  I will keep
looking
	online
	>  in the meantime.
	> 
	> Thanks
	> Raj
	> 
	> 
	> -----Original Message-----
	> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
	> Sent: Thursday, September 23, 2010 4:20 PM
	> To: user@nutch.apache.org
	> Subject: RE: Duplicate URLs
	> 
	> Use deduplication.
	>  
	> -----Original message-----
	> From: Nemani, Raj <Ra...@turner.com>
	> Sent: Thu 23-09-2010 22:12
	> To: user@nutch.apache.org;
	> Subject: Duplicate URLs
	> 
	> All,
	> 
	> 
	> 
	> I just wanted to see if there is way we can tell Nutch to
treat the
	> following URLs as same.  
	> 
	> 
	> 
	> 
	> 
	>
	
http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
	> _action.htm
	> 
	> 
	> 
	>
http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
	> 
	> 
	> 
	> 
	> 
	> As you know you can set up web servers such that both the URLs
above
	> resolve to the same end point.  In other words the two URLs
are
	actually
	> *same* even though they are physically different.  Is there
anyway I
	can
	> tell NUTCH to treat these URLs as same?
	> 
	> I cannot use to filtering to ignore one or the other (wither
with
	> DOMAINNAME or without) because I need to allow both patterns
to allow
	> genuine URLs.
	> 
	> 
	> 
	> Thanks
	> 
	> Raj
	> 
	
	Markus Jelsma - Technisch Architect - Buyways BV
	http://www.linkedin.com/in/markus17
	050-8536620 / 06-50258350

RE: Duplicate URLs

Posted by Markus Jelsma <ma...@buyways.nl>.

Nutch has a fuzzy hashing algorithm for generating digests for a document. Solr incorporates the TextProfileSignature that comes from Nutch. I'm not sure if the digest field is generated by this algoritm, if it is, it makes sense to use that for deduplication. If the digest field is generated by an exact hashing algoritm such as MD5, it won't allow you do use the TextProfileSignature algoritm in Solr for fuzzy matching.
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Fri 24-09-2010 23:18
To: user@nutch.apache.org; Markus Jelsma <ma...@buyways.nl>; 
Subject: RE: Duplicate URLs

So I used to Solr deduping in the end by configuring Solr for Deduping
in SolrConfig.xml.  Here is what I ended up doing.  I noticed that the
digest field generated by Nutch for the two URLs I mentioned is same.
So I used that as the filed and created new Signature field in the
schma.xml.  Here are my config changes from SolConfig.xml.  It does feel
weird to use he digest filed for this purpose.  Does this make sense?  

SolrConfig.xml
---------------------------



<updateRequestProcessorChain name="dedupe">
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>
      <bool name="enabled">true</bool>
      <str name="signatureField">sig</str>
      <bool name="overwriteDupes">true</bool>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
/str> 
  <str name="fields">digest</str>
  </processor>
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>


<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
   <lst name="defaults">
     <str name="update.processor">dedupe</str>
   </lst>
 </requestHandler>

Schema.xml
--------------------

<field name="sig" type="string" stored="true" indexed="true"
multiValued="true" />

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 7:29 AM
To: user@nutch.apache.org
Subject: Re: Duplicate URLs


On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
> My solr index has sources other than the data generated from Nutch
crawls. 
>  What this means is that when I do solrDedup from Nutch, the dedup
process
>  will happen across the entire solr Index, not just on the documents
>  generated and submitted by Nutch, Am I correct?

Correct.

> 
> Is there a way I can have the deduping done on the Nutch side before
>  sending the data set to Solr even if it means I need to generate the
Nutch
>  index.  Just to reiterate my dupes are based on the content, not on
the
>  URL.

I'm not sure. You'll need a Nutch index to deduplicate first. But it's
the 
index that will be deduplicated, not the parsed segments. Sending stuff
to 
Solr then would not be very helpful.

> 
> On the other hand it looks like you have to supply the Nutch index
>  directory to Nutch dedup command, not the segments directory.  Here
are
>  the Hadoop log entries. Could the documentation be wrong?  Note that
I
>  have not generated the Nutch index.  After merging the segements and
>  inverting the links, I just called the Dedup on my segments
directory.  It
>  did not seem to do anything.  Do I have to build the Nutch Index and
then
>  call the dedup on the segments directory?

Nutch dedup command required a parameter pointing to an index, you'll
need an 
index in Nutch to dedup.

> 
> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup:
starting at
>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO
indexer.DeleteDuplicates
>  - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792
WARN 
>  mapred.FileInputFormat - Can't open index at
>
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+
21
> 47483647, skipping. (no segments* file found in
>
org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch
-1
> .2/crawl/segments/20100923174134: files: [content, crawl_fetch,
>  crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23
>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at
>  2010-09-23 17:42:45, elapsed: 00:00:05

That's the segments* doing there?  It shouldn't.

> 
> Thanks for all your help
> Raj
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:52 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> bin/nutch solrdedup
> Usage: SolrDeleteDuplicates <solr url>
> 
>  
> 
> You could also handle deduplication in your Solr configuration. It
exposes
>  more options and lets you mark duplicates (documents with identical
>  signatures) or overwrite them (deduplicate).
> 
>  
> 
> http://wiki.apache.org/solr/Deduplication
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:48
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Thanks again.  One final question.  I do not create Nutch index.  I
just
>  push the crawl segments to Solr using the follwing command line.  
> 
> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
>  crawl/segments/*
> 
> Do I need to create Nutch index to get the Dedup going because I saw
online
>  script that submits the nutch Index directory to Dedup command.  Can
I
>  just pass in the Segments directory (as shown in the document from
the
>  link you sent) without having to build the Nutch index?
> 
> I am going to try both ways in the mean time.
> 
> Thanks so much again
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:33 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Deduplication is a mechanism where a hash is being generated based on
>  contents of some field (title and/or content as the usual). It can be
as
>  simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate
itself
>  by using that command line option. You can also use Nutch to
deduplicate
>  whatever you pushed to a Solr index, and you can configure Solr to
>  deduplicate as well.
> 
>  
> 
> http://wiki.apache.org/nutch/CommandLineOptions
> 
>  
> 
> 
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:26
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Markus,
> 
> Thanks so much.
> Any link that outlines the step to take that you can forward or just
>  explain if you can.  I appreciate your help.  I will keep looking
online
>  in the meantime.
> 
> Thanks
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:20 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Use deduplication.
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:12
> To: user@nutch.apache.org;
> Subject: Duplicate URLs
> 
> All,
> 
> 
> 
> I just wanted to see if there is way we can tell Nutch to treat the
> following URLs as same.  
> 
> 
> 
> 
> 
>
http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
> _action.htm
> 
> 
> 
> http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
> 
> 
> 
> 
> 
> As you know you can set up web servers such that both the URLs above
> resolve to the same end point.  In other words the two URLs are
actually
> *same* even though they are physically different.  Is there anyway I
can
> tell NUTCH to treat these URLs as same?
> 
> I cannot use to filtering to ignore one or the other (wither with
> DOMAINNAME or without) because I need to allow both patterns to allow
> genuine URLs.
> 
> 
> 
> Thanks
> 
> Raj
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

So I used to Solr deduping in the end by configuring Solr for Deduping
in SolrConfig.xml.  Here is what I ended up doing.  I noticed that the
digest field generated by Nutch for the two URLs I mentioned is same.
So I used that as the filed and created new Signature field in the
schma.xml.  Here are my config changes from SolConfig.xml.  It does feel
weird to use he digest filed for this purpose.  Does this make sense?  

SolrConfig.xml
---------------------------



<updateRequestProcessorChain name="dedupe">
     <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory"
>
       <bool name="enabled">true</bool>
       <str name="signatureField">sig</str>
       <bool name="overwriteDupes">true</bool>
       <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature<
/str> 
	   <str name="fields">digest</str>
	   </processor>
     <processor class="solr.LogUpdateProcessorFactory" />
     <processor class="solr.RunUpdateProcessorFactory" />
   </updateRequestProcessorChain>


<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" >
    <lst name="defaults">
      <str name="update.processor">dedupe</str>
    </lst>
  </requestHandler>

Schema.xml
--------------------

<field name="sig" type="string" stored="true" indexed="true"
multiValued="true" />

-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Friday, September 24, 2010 7:29 AM
To: user@nutch.apache.org
Subject: Re: Duplicate URLs


On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
> My solr index has sources other than the data generated from Nutch
crawls. 
>  What this means is that when I do solrDedup from Nutch, the dedup
process
>  will happen across the entire solr Index, not just on the documents
>  generated and submitted by Nutch, Am I correct?

Correct.

> 
> Is there a way I can have the deduping done on the Nutch side before
>  sending the data set to Solr even if it means I need to generate the
Nutch
>  index.  Just to reiterate my dupes are based on the content, not on
the
>  URL.

I'm not sure. You'll need a Nutch index to deduplicate first. But it's
the 
index that will be deduplicated, not the parsed segments. Sending stuff
to 
Solr then would not be very helpful.

> 
> On the other hand it looks like you have to supply the Nutch index
>  directory to Nutch dedup command, not the segments directory.  Here
are
>  the Hadoop log entries. Could the documentation be wrong?  Note that
I
>  have not generated the Nutch index.  After merging the segements and
>  inverting the links, I just called the Dedup on my segments
directory.  It
>  did not seem to do anything.  Do I have to build the Nutch Index and
then
>  call the dedup on the segments directory?

Nutch dedup command required a parameter pointing to an index, you'll
need an 
index in Nutch to dedup.

> 
> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup:
starting at
>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO
indexer.DeleteDuplicates
>  - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792
WARN 
>  mapred.FileInputFormat - Can't open index at
>
file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+
21
> 47483647, skipping. (no segments* file found in
>
org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch
-1
> .2/crawl/segments/20100923174134: files: [content, crawl_fetch,
>  crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23
>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at
>  2010-09-23 17:42:45, elapsed: 00:00:05

That's the segments* doing there?  It shouldn't.

> 
> Thanks for all your help
> Raj
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:52 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> bin/nutch solrdedup
> Usage: SolrDeleteDuplicates <solr url>
> 
>  
> 
> You could also handle deduplication in your Solr configuration. It
exposes
>  more options and lets you mark duplicates (documents with identical
>  signatures) or overwrite them (deduplicate).
> 
>  
> 
> http://wiki.apache.org/solr/Deduplication
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:48
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Thanks again.  One final question.  I do not create Nutch index.  I
just
>  push the crawl segments to Solr using the follwing command line.  
> 
> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
>  crawl/segments/*
> 
> Do I need to create Nutch index to get the Dedup going because I saw
online
>  script that submits the nutch Index directory to Dedup command.  Can
I
>  just pass in the Segments directory (as shown in the document from
the
>  link you sent) without having to build the Nutch index?
> 
> I am going to try both ways in the mean time.
> 
> Thanks so much again
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:33 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Deduplication is a mechanism where a hash is being generated based on
>  contents of some field (title and/or content as the usual). It can be
as
>  simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate
itself
>  by using that command line option. You can also use Nutch to
deduplicate
>  whatever you pushed to a Solr index, and you can configure Solr to
>  deduplicate as well.
> 
>  
> 
> http://wiki.apache.org/nutch/CommandLineOptions
> 
>  
> 
> 
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:26
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Markus,
> 
> Thanks so much.
> Any link that outlines the step to take that you can forward or just
>  explain if you can.  I appreciate your help.  I will keep looking
online
>  in the meantime.
> 
> Thanks
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:20 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Use deduplication.
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:12
> To: user@nutch.apache.org;
> Subject: Duplicate URLs
> 
> All,
> 
> 
> 
> I just wanted to see if there is way we can tell Nutch to treat the
> following URLs as same.  
> 
> 
> 
> 
> 
>
http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
> _action.htm
> 
> 
> 
> http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
> 
> 
> 
> 
> 
> As you know you can set up web servers such that both the URLs above
> resolve to the same end point.  In other words the two URLs are
actually
> *same* even though they are physically different.  Is there anyway I
can
> tell NUTCH to treat these URLs as same?
> 
> I cannot use to filtering to ignore one or the other (wither with
> DOMAINNAME or without) because I need to allow both patterns to allow
> genuine URLs.
> 
> 
> 
> Thanks
> 
> Raj
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Duplicate URLs

Posted by Markus Jelsma <ma...@buyways.nl>.

On Friday 24 September 2010 00:33:54 Nemani, Raj wrote:
> My solr index has sources other than the data generated from Nutch crawls. 
>  What this means is that when I do solrDedup from Nutch, the dedup process
>  will happen across the entire solr Index, not just on the documents
>  generated and submitted by Nutch, Am I correct?

Correct.

> 
> Is there a way I can have the deduping done on the Nutch side before
>  sending the data set to Solr even if it means I need to generate the Nutch
>  index.  Just to reiterate my dupes are based on the content, not on the
>  URL.

I'm not sure. You'll need a Nutch index to deduplicate first. But it's the 
index that will be deduplicated, not the parsed segments. Sending stuff to 
Solr then would not be very helpful.

> 
> On the other hand it looks like you have to supply the Nutch index
>  directory to Nutch dedup command, not the segments directory.  Here are
>  the Hadoop log entries. Could the documentation be wrong?  Note that I
>  have not generated the Nutch index.  After merging the segements and
>  inverting the links, I just called the Dedup on my segments directory.  It
>  did not seem to do anything.  Do I have to build the Nutch Index and then
>  call the dedup on the segments directory?

Nutch dedup command required a parameter pointing to an index, you'll need an 
index in Nutch to dedup.

> 
> 2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup: starting at
>  2010-09-23 17:42:39 2010-09-23 17:42:39,698 INFO  indexer.DeleteDuplicates
>  - Dedup: adding indexes in: crawl/segments 2010-09-23 17:42:40,792 WARN 
>  mapred.FileInputFormat - Can't open index at
>  file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+21
> 47483647, skipping. (no segments* file found in
>  org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch-1
> .2/crawl/segments/20100923174134: files: [content, crawl_fetch,
>  crawl_generate, crawl_parse, parse_data, parse_text]) 2010-09-23
>  17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at
>  2010-09-23 17:42:45, elapsed: 00:00:05

That's the segments* doing there?  It shouldn't.

> 
> Thanks for all your help
> Raj
> 
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:52 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> bin/nutch solrdedup
> Usage: SolrDeleteDuplicates <solr url>
> 
>  
> 
> You could also handle deduplication in your Solr configuration. It exposes
>  more options and lets you mark duplicates (documents with identical
>  signatures) or overwrite them (deduplicate).
> 
>  
> 
> http://wiki.apache.org/solr/Deduplication
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:48
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Thanks again.  One final question.  I do not create Nutch index.  I just
>  push the crawl segments to Solr using the follwing command line.  
> 
> bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb
>  crawl/segments/*
> 
> Do I need to create Nutch index to get the Dedup going because I saw online
>  script that submits the nutch Index directory to Dedup command.  Can I
>  just pass in the Segments directory (as shown in the document from the
>  link you sent) without having to build the Nutch index?
> 
> I am going to try both ways in the mean time.
> 
> Thanks so much again
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:33 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Deduplication is a mechanism where a hash is being generated based on
>  contents of some field (title and/or content as the usual). It can be as
>  simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself
>  by using that command line option. You can also use Nutch to deduplicate
>  whatever you pushed to a Solr index, and you can configure Solr to
>  deduplicate as well.
> 
>  
> 
> http://wiki.apache.org/nutch/CommandLineOptions
> 
>  
> 
> 
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:26
> To: user@nutch.apache.org;
> Subject: RE: Duplicate URLs
> 
> Markus,
> 
> Thanks so much.
> Any link that outlines the step to take that you can forward or just
>  explain if you can.  I appreciate your help.  I will keep looking online
>  in the meantime.
> 
> Thanks
> Raj
> 
> 
> -----Original Message-----
> From: Markus Jelsma [mailto:markus.jelsma@buyways.nl]
> Sent: Thursday, September 23, 2010 4:20 PM
> To: user@nutch.apache.org
> Subject: RE: Duplicate URLs
> 
> Use deduplication.
>  
> -----Original message-----
> From: Nemani, Raj <Ra...@turner.com>
> Sent: Thu 23-09-2010 22:12
> To: user@nutch.apache.org;
> Subject: Duplicate URLs
> 
> All,
> 
> 
> 
> I just wanted to see if there is way we can tell Nutch to treat the
> following URLs as same.  
> 
> 
> 
> 
> 
> http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
> _action.htm
> 
> 
> 
> http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm
> 
> 
> 
> 
> 
> As you know you can set up web servers such that both the URLs above
> resolve to the same end point.  In other words the two URLs are actually
> *same* even though they are physically different.  Is there anyway I can
> tell NUTCH to treat these URLs as same?
> 
> I cannot use to filtering to ignore one or the other (wither with
> DOMAINNAME or without) because I need to allow both patterns to allow
> genuine URLs.
> 
> 
> 
> Thanks
> 
> Raj
> 

Markus Jelsma - Technisch Architect - Buyways BV
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

My solr index has sources other than the data generated from Nutch crawls.  What this means is that when I do solrDedup from Nutch, the dedup process will happen across the entire solr Index, not just on the documents generated and submitted by Nutch, Am I correct?

Is there a way I can have the deduping done on the Nutch side before sending the data set to Solr even if it means I need to generate the Nutch index.  Just to reiterate my dupes are based on the content, not on the URL.

On the other hand it looks like you have to supply the Nutch index directory to Nutch dedup command, not the segments directory.  Here are the Hadoop log entries. Could the documentation be wrong?  Note that I have not generated the Nutch index.  After merging the segements and inverting the links, I just called the Dedup on my segments directory.  It did not seem to do anything.  Do I have to build the Nutch Index and then call the dedup on the segments directory?

2010-09-23 17:42:39,673 INFO  indexer.DeleteDuplicates - Dedup: starting at 2010-09-23 17:42:39
2010-09-23 17:42:39,698 INFO  indexer.DeleteDuplicates - Dedup: adding indexes in: crawl/segments
2010-09-23 17:42:40,792 WARN  mapred.FileInputFormat - Can't open index at file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134:0+2147483647, skipping. (no segments* file found in org.apache.nutch.indexer.FsDirectory@file:/C:/projects/OpenSource/branch-1.2/crawl/segments/20100923174134: files: [content, crawl_fetch, crawl_generate, crawl_parse, parse_data, parse_text])
2010-09-23 17:42:45,200 INFO  indexer.DeleteDuplicates - Dedup: finished at 2010-09-23 17:42:45, elapsed: 00:00:05

Thanks for all your help
Raj



-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:52 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

bin/nutch solrdedup
Usage: SolrDeleteDuplicates <solr url>

 

You could also handle deduplication in your Solr configuration. It exposes more options and lets you mark duplicates (documents with identical signatures) or overwrite them (deduplicate).

 

http://wiki.apache.org/solr/Deduplication
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:48
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Thanks again.  One final question.  I do not create Nutch index.  I just push the crawl segments to Solr using the follwing command line.  

bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/*

Do I need to create Nutch index to get the Dedup going because I saw online script that submits the nutch Index directory to Dedup command.  Can I just pass in the Segments directory (as shown in the document from the link you sent) without having to build the Nutch index?

I am going to try both ways in the mean time.

Thanks so much again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:33 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:26
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:20 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Posted by Markus Jelsma <ma...@buyways.nl>.

bin/nutch solrdedup
Usage: SolrDeleteDuplicates <solr url>

 

You could also handle deduplication in your Solr configuration. It exposes more options and lets you mark duplicates (documents with identical signatures) or overwrite them (deduplicate).

 

http://wiki.apache.org/solr/Deduplication
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:48
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Thanks again.  One final question.  I do not create Nutch index.  I just push the crawl segments to Solr using the follwing command line.  

bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/*

Do I need to create Nutch index to get the Dedup going because I saw online script that submits the nutch Index directory to Dedup command.  Can I just pass in the Segments directory (as shown in the document from the link you sent) without having to build the Nutch index?

I am going to try both ways in the mean time.

Thanks so much again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:33 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:26
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:20 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

Thanks again.  One final question.  I do not create Nutch index.  I just push the crawl segments to Solr using the follwing command line.  

bin/nutch solrindex $solr_endpoint crawl/crawldb crawl/linkdb crawl/segments/*

Do I need to create Nutch index to get the Dedup going because I saw online script that submits the nutch Index directory to Dedup command.  Can I just pass in the Segments directory (as shown in the document from the link you sent) without having to build the Nutch index?

I am going to try both ways in the mean time.

Thanks so much again
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:33 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:26
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:20 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Posted by Markus Jelsma <ma...@buyways.nl>.

Deduplication is a mechanism where a hash is being generated based on contents of some field (title and/or content as the usual). It can be as simple as an MD5 hash or a more fuzzy match. Nutch can deduplicate itself by using that command line option. You can also use Nutch to deduplicate whatever you pushed to a Solr index, and you can configure Solr to deduplicate as well.

 

http://wiki.apache.org/nutch/CommandLineOptions

 


 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:26
To: user@nutch.apache.org; 
Subject: RE: Duplicate URLs

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:20 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Posted by "Nemani, Raj" <Ra...@turner.com>.

Markus,

Thanks so much.
Any link that outlines the step to take that you can forward or just explain if you can.  I appreciate your help.  I will keep looking online in the meantime.

Thanks
Raj


-----Original Message-----
From: Markus Jelsma [mailto:markus.jelsma@buyways.nl] 
Sent: Thursday, September 23, 2010 4:20 PM
To: user@nutch.apache.org
Subject: RE: Duplicate URLs

Use deduplication. 
 
-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,



I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  





http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm



http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm





As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.



Thanks

Raj

RE: Duplicate URLs

Posted by Markus Jelsma <ma...@buyways.nl>.

Use deduplication. 

-----Original message-----
From: Nemani, Raj <Ra...@turner.com>
Sent: Thu 23-09-2010 22:12
To: user@nutch.apache.org; 
Subject: Duplicate URLs

All,

I just wanted to see if there is way we can tell Nutch to treat the
following URLs as same.  

http://SITENAME.DOMAINNAME.com/research/briefing_books/avian_flu/who_rec
_action.htm

http://SITENAME/research/briefing_books/avian_flu/who_rec_action.htm

As you know you can set up web servers such that both the URLs above
resolve to the same end point.  In other words the two URLs are actually
*same* even though they are physically different.  Is there anyway I can
tell NUTCH to treat these URLs as same?

I cannot use to filtering to ignore one or the other (wither with
DOMAINNAME or without) because I need to allow both patterns to allow
genuine URLs.

Thanks

Raj