You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Christoph Goller <go...@detego-software.de> on 2003/10/25 15:51:18 UTC

Re: subclassing of IndexReader

Hi Doug,

I reviewed your changes for subclassing of IndexReader and stumbled
over the following:

*)SegmentMerger.merge() now delivers the number of documents. Therefore,
they don´t have to counted in IndexWriter.mergeSegments any longer.
I changed this and all unit test still work including TestIndexWriter
which tests exactly this number. So I think this small change is ok
and I will commit it.

*)I am just curious. What is IndexReader.undeleteAll needed for?

*)SegmentsReader.undeleteAll does not set hasDeletions to false.
I think this is a bug. Could you check please.

*)The optimized implementation of SegmentTermDocs.seek(TermEnum enum)
is essential in order to avoid unnecessary seek for termInfo in
SegmentMerger.appendPostings(...). The problem I see is that
SegmentTermDocs.seek(TermEnum enum) is public and there is no test to
assure that enum is from the same segment as SegmentTermDocs. I think
such a test should be added. If you agree, I can do that.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Doug Cutting <cu...@lucene.com>.
Christoph Goller wrote:
> That's quite interesting. I am currently involved in a small crawling 
> project. We only crawl a very limited number of news pages, some of them 
> several times per day. We found that there are often tiny changes on 
> these pages (spelling corrections, banner changes) which we would like 
> to ignore
> (classify as dublicate) while we want to recognize bigger changes. For 
> such a setting MD5 keys are
> not very helpful. How do you detect dublicates in Nutch?

Nutch currently only does MD5-based duplicate elimination.  So only 
exact duplicates are eliminated.

There's been a fair amount of work on better methods.  For example, 
there was Broder et. al.'s "Syntactic Clustering" work 
(http://gatekeeper.research.compaq.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html/).

However I've never seen anyone demonstrate how such methods can be 
efficiently applied to huge collections.  Perhaps they can, but it's not 
obvious to me.  I've also not followed this literature closely.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Andrzej Bialecki <ab...@getopt.org>.
Christoph Goller wrote:
> Otis Gospodnetic schrieb:
> 
>> I am also involved in a small project that deals with crawling. :)
>> I have not done this, yet, but have thought about the same problem that
>> you are asking about - detecting small changes in web pages.
>> Have you considered using Nilsimsa?
>>
>> Otis
> 
> 
> Hi Otis,
> 
> sorry for the delay. Due to some "management" decisions the subject of
> dublicate checking no longer has top priority for me. But it will probably
> be of interest again next year.I did not try Nilsimsa so far. Did you?

Nilsimsa, even though on the surface it appears to work reasonably well, 
has been heavily criticized for weak theoretical foundations. See the 
archives of Nilsimsa mailing list for details.

I have yet to find an open source alternative to it, though ...

-- 
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)




---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Otis Gospodnetic <ot...@yahoo.com>.
> be of interest again next year.I did not try Nilsimsa so far. Did
> you?
> 
> Christoph

No, not yet.  I am not sure if there is a Java port of it yet.  I
intend to take a look at it soon, though.

Otis


__________________________________
Do you Yahoo!?
Protect your identity with Yahoo! Mail AddressGuard
http://antispam.yahoo.com/whatsnewfree

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Christoph Goller <go...@detego-software.de>.
Otis Gospodnetic schrieb:
> I am also involved in a small project that deals with crawling. :)
> I have not done this, yet, but have thought about the same problem that
> you are asking about - detecting small changes in web pages.
> Have you considered using Nilsimsa?
> 
> Otis

Hi Otis,

sorry for the delay. Due to some "management" decisions the subject of
dublicate checking no longer has top priority for me. But it will probably
be of interest again next year.I did not try Nilsimsa so far. Did you?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Otis Gospodnetic <ot...@yahoo.com>.
I am also involved in a small project that deals with crawling. :)
I have not done this, yet, but have thought about the same problem that
you are asking about - detecting small changes in web pages.
Have you considered using Nilsimsa?

Otis


--- Christoph Goller <go...@detego-software.de> wrote:
> Hi Doug
> 
> 
> > 
> >>*)I am just curious. What is IndexReader.undeleteAll needed for?
> > 
> > 
> > In Nutch we have a rotating set of indexes.  For example, we might
> create a new index every day.  
> > Our crawler guarantees that pages will be re-indexed every 30 days,
> so we can, e.g., every day merge 
> > (or search w/o merging) the most recent 30 indexes.  So far so
> good.  But many pages are clones of 
> > other pages: different urls with the same content.  So, each time
> we deploy a new set of indexes we 
> > need to first perform duplicate detection to make sure that, for
> each unique content, only a single 
> > url is present, that with the highest link analysis score.  I
> implement this by first calling 
> > undeleteAll(), then perform the global duplicate detection,
> deleting duplicates from their index.  
> > Does this make sense?  Each day duplicate detection must be
> repeated when a new index is added, but 
> > first all of the previously detected duplicates must be cleared.
> > 
> 
> That's quite interesting. I am currently involved in a small crawling
> project. We only crawl a very 
> limited number of news pages, some of them several times per day. We
> found that there are often 
> tiny changes on these pages (spelling corrections, banner changes)
> which we would like to ignore
> (classify as dublicate) while we want to recognize bigger changes.
> For such a setting MD5 keys are
> not very helpful. How do you detect dublicates in Nutch?
> 
> Christoph
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
> 


__________________________________
Do you Yahoo!?
Exclusive Video Premiere - Britney Spears
http://launch.yahoo.com/promos/britneyspears/

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by Christoph Goller <go...@detego-software.de>.
Hi Doug


> 
>>*)I am just curious. What is IndexReader.undeleteAll needed for?
> 
> 
> In Nutch we have a rotating set of indexes.  For example, we might create a new index every day.  
> Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g., every day merge 
> (or search w/o merging) the most recent 30 indexes.  So far so good.  But many pages are clones of 
> other pages: different urls with the same content.  So, each time we deploy a new set of indexes we 
> need to first perform duplicate detection to make sure that, for each unique content, only a single 
> url is present, that with the highest link analysis score.  I implement this by first calling 
> undeleteAll(), then perform the global duplicate detection, deleting duplicates from their index.  
> Does this make sense?  Each day duplicate detection must be repeated when a new index is added, but 
> first all of the previously detected duplicates must be cleared.
> 

That's quite interesting. I am currently involved in a small crawling project. We only crawl a very 
limited number of news pages, some of them several times per day. We found that there are often 
tiny changes on these pages (spelling corrections, banner changes) which we would like to ignore
(classify as dublicate) while we want to recognize bigger changes. For such a setting MD5 keys are
not very helpful. How do you detect dublicates in Nutch?

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: subclassing of IndexReader

Posted by ap...@lucene.com.
>From Christoph Goller <go...@detego-software.de> on 25 Oct 2003:
> I reviewed your changes for subclassing of IndexReader

Thank you very much!  This makes me more comfortable with the changes.

> *)SegmentMerger.merge() now delivers the number of documents. Therefore,
> they don�t have to counted in IndexWriter.mergeSegments any longer.
> I changed this and all unit test still work including TestIndexWriter
> which tests exactly this number. So I think this small change is ok
> and I will commit it.

Sounds good to me.

> *)I am just curious. What is IndexReader.undeleteAll needed for?

In Nutch we have a rotating set of indexes.  For example, we might create a new index every day.  Our crawler guarantees that pages will be re-indexed every 30 days, so we can, e.g., every day merge (or search w/o merging) the most recent 30 indexes.  So far so good.  But many pages are clones of other pages: different urls with the same content.  So, each time we deploy a new set of indexes we need to first perform duplicate detection to make sure that, for each unique content, only a single url is present, that with the highest link analysis score.  I implement this by first calling undeleteAll(), then perform the global duplicate detection, deleting duplicates from their index.  Does this make sense?  Each day duplicate detection must be repeated when a new index is added, but first all of the previously detected duplicates must be cleared.

> *)SegmentsReader.undeleteAll does not set hasDeletions to false.
> I think this is a bug. Could you check please.

It indeed sounds like a bug.  I am on the road this week, reading email on a borrowed machine, and cannot check this right now.  Thanks for catching this!

> *)The optimized implementation of SegmentTermDocs.seek(TermEnum enum)
> is essential in order to avoid unnecessary seek for termInfo in
> SegmentMerger.appendPostings(...).

You really did review this well!  That was the only tricky thing about this change, required to make it perform well.  I'm impressed that you noticed it.

> The problem I see is that
> SegmentTermDocs.seek(TermEnum enum) is public and there is no test to
> assure that enum is from the same segment as SegmentTermDocs. I think
> such a test should be added. If you agree, I can do that.

That sounds like a good idea.  I am not good at error checking...

Thanks again for your detailed review.  The fixes you suggest all sound good to me.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org