You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Lucifersam <ro...@tagish.co.uk> on 2007/02/21 17:46:46 UTC

Quick questions - merging/deduping

Hi,

I'm new to Nutch and am trying to get my head around some basics... I need
to index two sites, one of which is under my control, into a single search.

The first site, under my control, I have ran a complete 'seed' crawl over
and would like to update the index daily. To avoid recrawling the whole site
I have set up a 'what's new/changed' page which I want to crawl daily to
pick up any changes. I then want to merge this with the complete crawl to
produce an up to date index. (I tried the recrawl script from the wiki but
it didn't seem to be doing what I wanted).

I have merged the two indexes in the following way:

- created a new directory mergedcrawl
- copied seedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00000
- copied changedcrawl/indexes/part-00000/* to mergedcrawl/indexes/part-00001
- ran 'bin/nutch dedup indexes' on mergedcrawl
- ran 'bin/nutch merge index indexes' on mergedcrawl
- copied /segments/* from both crawls into the mergedcrawl

Pointing the searcher.dir to the new directory, the search seems to return
results from both indexes successfully. Is this the correct way to do this?

The second site is not under my control, so I need to find an alternative
way to keep the index up to date. Am I correct in thinking that simply
recrawling the whole site is the easiest way to do this - or is there a way
to only index modified pages?

Finally - I seem to have a problem with identical pages with different urls
- i.e.

http://website/
http://website/default.htm

I was under the impression that these would be removed by the dedup process,
but this does not seem to be working. Is there something I'm missing? (I
also have a similar problem with the external site as it carries session ids
around in the URL which change - although the content of the duplicate pages
is identical).

Sorry for the long post - any help is appreciated!
-- 
View this message in context: http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9084405
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Quick questions - merging/deduping

Posted by Andrzej Bialecki <ab...@getopt.org>.
Lucifersam wrote:
> Thanks for the suggestions - I will look into this.
>
> Any comments/suggestions regarding the methods I am using to keep the index
> up to date?
>
>   

I can't see anything wrong with it, on the conceptual level - at the end 
indexes are dedup-ed and merged, so it should work fine.


-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Quick questions - merging/deduping

Posted by Lucifersam <ro...@tagish.co.uk>.


Andrzej Bialecki wrote:
> 
> Lucifersam wrote:
>> Andrzej Bialecki wrote:
>>   
>>> Lucifersam wrote:
>>>     
>>>> Finally - I seem to have a problem with identical pages with different
>>>> urls
>>>> - i.e.
>>>>
>>>> http://website/
>>>> http://website/default.htm
>>>>
>>>> I was under the impression that these would be removed by the dedup
>>>> process,
>>>> but this does not seem to be working. Is there something I'm missing? 
>>>>       
>>> Most likely the pages are slightly different - you can save them to 
>>> files, and then run a diff utility to check for differences.
>>>
>>>     
>>
>> You're right, there was a small difference in the HTML concerning some
>> timing comment, e.g:
>>
>> <!--Exec time = 265.625-->
>>
>> As this is not strictly content - is there a simply way to ignore
>> anything
>> within comments when looking at the content of a page?
>>   
> 
> 
> You can provide your own implementation of a Signature - please see the 
> javadocs for this class - and then set this class in nutch-site.xml.
> 
> A common trick is to use just the plain text version of the page, and 
> further "normalize" it by replacing all whitespace with exactly single 
> spaces, bringing all tokens to lowercase, optionally filter out all 
> digits, and also optionally removing all words that occur only once.
> 
> 

Thanks for the suggestions - I will look into this.

Any comments/suggestions regarding the methods I am using to keep the index
up to date?

-- 
View this message in context: http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9098266
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Quick questions - merging/deduping

Posted by Andrzej Bialecki <ab...@getopt.org>.
Lucifersam wrote:
> Andrzej Bialecki wrote:
>   
>> Lucifersam wrote:
>>     
>>> Finally - I seem to have a problem with identical pages with different
>>> urls
>>> - i.e.
>>>
>>> http://website/
>>> http://website/default.htm
>>>
>>> I was under the impression that these would be removed by the dedup
>>> process,
>>> but this does not seem to be working. Is there something I'm missing? 
>>>       
>> Most likely the pages are slightly different - you can save them to 
>> files, and then run a diff utility to check for differences.
>>
>>     
>
> You're right, there was a small difference in the HTML concerning some
> timing comment, e.g:
>
> <!--Exec time = 265.625-->
>
> As this is not strictly content - is there a simply way to ignore anything
> within comments when looking at the content of a page?
>   


You can provide your own implementation of a Signature - please see the 
javadocs for this class - and then set this class in nutch-site.xml.

A common trick is to use just the plain text version of the page, and 
further "normalize" it by replacing all whitespace with exactly single 
spaces, bringing all tokens to lowercase, optionally filter out all 
digits, and also optionally removing all words that occur only once.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Quick questions - merging/deduping

Posted by Lucifersam <ro...@tagish.co.uk>.

Andrzej Bialecki wrote:
> 
> Lucifersam wrote:
>> Finally - I seem to have a problem with identical pages with different
>> urls
>> - i.e.
>>
>> http://website/
>> http://website/default.htm
>>
>> I was under the impression that these would be removed by the dedup
>> process,
>> but this does not seem to be working. Is there something I'm missing? 
> 
> Most likely the pages are slightly different - you can save them to 
> files, and then run a diff utility to check for differences.
> 

You're right, there was a small difference in the HTML concerning some
timing comment, e.g:

<!--Exec time = 265.625-->

As this is not strictly content - is there a simply way to ignore anything
within comments when looking at the content of a page?


Andrzej Bialecki wrote:
> 
>> (I
>> also have a similar problem with the external site as it carries session
>> ids
>> around in the URL which change - although the content of the duplicate
>> pages
>> is identical).
>>   
> 
> You can remove session IDs using URLNormalizers - see e.g. the 
> regex-urlnormalizer.xml for an example how to do this.
> 

Thanks - I will look into this.

-- 
View this message in context: http://www.nabble.com/Quick-questions---merging-deduping-tf3267849.html#a9084994
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: Quick questions - merging/deduping

Posted by Andrzej Bialecki <ab...@getopt.org>.
Lucifersam wrote:
> Finally - I seem to have a problem with identical pages with different urls
> - i.e.
>
> http://website/
> http://website/default.htm
>
> I was under the impression that these would be removed by the dedup process,
> but this does not seem to be working. Is there something I'm missing? 

Most likely the pages are slightly different - you can save them to 
files, and then run a diff utility to check for differences.


> (I
> also have a similar problem with the external site as it carries session ids
> around in the URL which change - although the content of the duplicate pages
> is identical).
>   

You can remove session IDs using URLNormalizers - see e.g. the 
regex-urlnormalizer.xml for an example how to do this.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com