You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by climbingrose <cl...@gmail.com> on 2007/09/24 12:48:50 UTC

Implication of not calling closeSearcher() in DirectUpdateHandler2?

I notice that in DUH2, everytime a new document is added, the searcher is
closed. I'm modifying DUH2 source code to enable a custom dedup process
which requires access to the index. Obviously, closing and opening the index
for every add/update are expensive. Therefore, I  temporarily comment out
closeSearcher() line in addDoc() method:

        if (rc == 1) {
          // adding document -- prep writer
          //closeSearcher();
          openWriter();
          tracker.addedDocument();          
        } else {
          // exit prematurely
          return rc;
        }

Everything seems to be working so far. However, I haven't understood the
implication of doing this. Any explanation?

Thanks,

Regards,
Cuong Hoang
-- 
View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12857591
Sent from the Solr - Dev mailing list archive at Nabble.com.

Near duplicate detection [was: Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?]

Posted by Steven Rowe <sa...@syr.edu>.

Hi,

Cuong Hoang wrote:
> BTW, has anyone here done any serious near duplication detection with Solr?
> If yes, what approaches did you use?
[...]
> Unfortunately some of our documents are "near duplications" which means they
> are mostly identical (>75%) but usually not 100% identical. hashCode is very
> sensitive to small changes so it can't be used in our case. 

You may be interested in this Lucene java-user ML thread:

<http://www.gossamer-threads.com/lists/lucene/java-user/41103>

The Nutch TextProfileSignature implementation[1] mentioned in the
above-linked thread appears to take an MD5 signature of the
frequency-ordered downcased whitespace-separated tokens from a document.
 This approach is not quite as sensitive to small changes as a direct
hash of the content, but it will likely fail fairly often if you're
looking at differences of more than a few percent (as your ">75%
identical" seems to indicate).

I have done some small-scale deduplication work (without Solr), and
found that a small preprocessing step using regular expressions to
remove changeable content that was not meaningful for the purposes of
comparison (e.g. hit counters and date/time stamps) was fairly
successful in reducing the error rate for a brute-force term frequency
comparison approach (i.e., direct calculation of the angle between doc
pairs' term vectors).

Steve

[1] API doc for Nutch TextProfileSignature class:
<http://lucene.apache.org/nutch/apidocs/org/apache/nutch/crawl/TextProfileSignature.html>

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by Chris Hostetter <ho...@fucit.org>.

: I'm doing near duplication detection on a fairly large number of documents.
: Each document to be added to Solr will be compared with sample documents
: from all clusters in the index. I could of course, dedupe documents at
: client side but the performance will not be as good.

have you considered the UpdateRequestProcessor API as an alternative to 
mucking with DUH2 directly?

http://lucene.apache.org/solr/api/org/apache/solr/update/processor/package-summary.html

(i don't really know much of the details about it, but i know it was added 
specificly to support more "biz logic" related tasks at update time -- as 
oppoed to the really low level nitty gritty updating that DUH2 worries 
about directly)



-Hoss

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by climbingrose <cl...@gmail.com>.

Thanks Walter, 

Unfortunately some of our documents are "near duplications" which means they
are mostly identical (>75%) but usually not 100% identical. hashCode is very
sensitive to small changes so it can't be used in our case. 


Walter Ferrara-2 wrote:
> 
> solr have unique keys, which do that "avoid duplicate" work for you, so
> you may try to make some kind of unique identifier out of the text your
> going to index, and use that as a solr <uniqueKey>.
> 
> You could try to create a sort of hashCode or something like that from
> the text your are going to index, and use that as uniquekey of the
> schema -  the next time you're going to add the same text, you should
> get the same key, and so solr will not add it again, but just update it
> (or at least it will be a lot simpler to understand if that document is
> already present in the index).
> 
> any other thoughts?
> --
> Walter
> 
> climbingrose wrote:
>>   
>>>> You would get autowarming, etc, by default though - not what you want
>>>>       
>>> >from a searcher that is  only used for deletions.
>>>     
>>
>> As a work around, I manually initialise LRUCache instance in DUH2
>> constructor. It works but not very elegant because you can't view cache's
>> statistics info in Solr admin...
>>
>>   
>>>> What problem are you trying to solve that requires directly using or
>>>> modifying DUH2?
>>>>       
>>
>> I'm doing near duplication detection on a fairly large number of
>> documents.
>> Each document to be added to Solr will be compared with sample documents
>> from all clusters in the index. I could of course, dedupe documents at
>> client side but the performance will not be as good.
>>
>> BTW, has anyone here done any serious near duplication detection with
>> Solr?
>> If yes, what approaches did you use?
>>
>> Thanks.
>>   
> 
> 

-- 
View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12874713
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by Walter Ferrara <wa...@gmail.com>.

solr have unique keys, which do that "avoid duplicate" work for you, so
you may try to make some kind of unique identifier out of the text your
going to index, and use that as a solr <uniqueKey>.

You could try to create a sort of hashCode or something like that from
the text your are going to index, and use that as uniquekey of the
schema -  the next time you're going to add the same text, you should
get the same key, and so solr will not add it again, but just update it
(or at least it will be a lot simpler to understand if that document is
already present in the index).

any other thoughts?
--
Walter

climbingrose wrote:
>   
>>> You would get autowarming, etc, by default though - not what you want
>>>       
>> >from a searcher that is  only used for deletions.
>>     
>
> As a work around, I manually initialise LRUCache instance in DUH2
> constructor. It works but not very elegant because you can't view cache's
> statistics info in Solr admin...
>
>   
>>> What problem are you trying to solve that requires directly using or
>>> modifying DUH2?
>>>       
>
> I'm doing near duplication detection on a fairly large number of documents.
> Each document to be added to Solr will be compared with sample documents
> from all clusters in the index. I could of course, dedupe documents at
> client side but the performance will not be as good.
>
> BTW, has anyone here done any serious near duplication detection with Solr?
> If yes, what approaches did you use?
>
> Thanks.
>

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by climbingrose <cl...@gmail.com>.


>>You would get autowarming, etc, by default though - not what you want
>>from a searcher that is  only used for deletions.

As a work around, I manually initialise LRUCache instance in DUH2
constructor. It works but not very elegant because you can't view cache's
statistics info in Solr admin...

>>What problem are you trying to solve that requires directly using or
>>modifying DUH2?

I'm doing near duplication detection on a fairly large number of documents.
Each document to be added to Solr will be compared with sample documents
from all clusters in the index. I could of course, dedupe documents at
client side but the performance will not be as good.

BTW, has anyone here done any serious near duplication detection with Solr?
If yes, what approaches did you use?

Thanks.
-- 
View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12861789
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/24/07, climbingrose <cl...@gmail.com> wrote:
> Thanks for the clarifications, Yonik. The other thing I notice is
> openSearcher() only creates a SolrIndexSearcher without cache (useCache =
> false). Therefore, when I try to use generic cache defined in
> solrconfig.xml, SolrIndexSearcher.getCache() method returns null. Would it
> be OK to turn on searcher cache?

You would get autowarming, etc, by default though - not what you want
from a searcher that is  only used for deletions.

What problem are you trying to solve that requires directly using or
modifying DUH2?

-Yonik

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by climbingrose <cl...@gmail.com>.

Thanks for the clarifications, Yonik. The other thing I notice is
openSearcher() only creates a SolrIndexSearcher without cache (useCache =
false). Therefore, when I try to use generic cache defined in
solrconfig.xml, SolrIndexSearcher.getCache() method returns null. Would it
be OK to turn on searcher cache? 


Yonik Seeley wrote:
> 
> On 9/24/07, climbingrose <cl...@gmail.com> wrote:
>> I notice that in DUH2, everytime a new document is added, the searcher is
>> closed.
> 
> Keep in mind that for the version you are looking at, the open
> searcher is used *only* for deleting docs... hence if it is open, then
> there are pending deletes to be flushed and it must be closed before
> the IndexWriter is opened.
> 
> This is done lazily so that back-to-back delete-by-query commands
> don't have to keep opening and closing the searcher.  For back-to-back
> adds, the searcher will remain closed and the writer will remain open,
> so nothing is being open+closed per add.
> 
> 
> -Yonik
> 
> 

-- 
View this message in context: http://www.nabble.com/Implication-of-not-calling-closeSearcher%28%29-in-DirectUpdateHandler2--tf4508411.html#a12861178
Sent from the Solr - Dev mailing list archive at Nabble.com.

Re: Implication of not calling closeSearcher() in DirectUpdateHandler2?

Posted by Yonik Seeley <yo...@apache.org>.

On 9/24/07, climbingrose <cl...@gmail.com> wrote:
> I notice that in DUH2, everytime a new document is added, the searcher is
> closed.

Keep in mind that for the version you are looking at, the open
searcher is used *only* for deleting docs... hence if it is open, then
there are pending deletes to be flushed and it must be closed before
the IndexWriter is opened.

This is done lazily so that back-to-back delete-by-query commands
don't have to keep opening and closing the searcher.  For back-to-back
adds, the searcher will remain closed and the writer will remain open,
so nothing is being open+closed per add.

-Yonik