You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by eks dev <ek...@yahoo.co.uk> on 2011/06/28 22:31:27 UTC

conditionally update document on unique id

Quick question,
Is there a way with solr to conditionally update document on unique
id? Meaning, default, add behavior if id is not already in index and
*not to touch index" if already there.

Deletes are not important (no sync issues).

I am asking because I noticed with deduplication turned on,
index-files get modified even if I update the same documents again
(same signatures).
I am facing very high dupes rate (40-50%), and setup is going to be
master-slave with high commit rate (requirement is to reduce
propagation latency for updates). Having unnecessary index
modifications is going to waste  "effort" to ship the same information
again and again.

if there is no standard way, what would be the fastest way to check if
Term exists in index from UpdateRequestProcessor?

I intend to extend SignatureUpdateProcessor to prevent a document from
propagating down the chain if this happens?
Would that be a way to deal with it? I repeat, there are no deletes to
make headaches with synchronization


Thanks,
eks

Re: conditionally update document on unique id

Posted by Erick Erickson <er...@gmail.com>.
Note that the original question was when working from a custom update request
handler, you're not doing that at all...

It's not clear to me whether doing queries ahead of time like you're doing is
more or less speedy than a custom update request handler given that in
the one case you're querying a bunch, and in the other you're transmitting
data across the wire then throwing lots of it away. I guess it depends on
whether the file is sent across before deciding whether to add it or not in
a custom update processor. (hint, hint, someone who knows can
answer here <G>). It seems like it would have to be in order to get the
document ID out of the doc in the first place, at least in the XML case....

Anyway, your query could probably be made more efficient just from the
perspective of fewer search requests if you asked
for a bunch of IDs at once, something like:

http://solr/select?&q=id:(myid.doc1 OR myid.doc2 OR myid.doc3)&rows=3&fl=id

and you'd have to look inside the results for IDs if the result count > 0,
but I don't know whether it's enough faster to matter.

Best
Erick

On Mon, Nov 28, 2011 at 7:26 PM, chadsteele.com <ch...@chadsteele.com> wrote:
> oops... the query looks more like this
>
> http://solr/select?&q=*id:*myid.doc&rows=0
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/conditionally-update-document-on-unique-id-tp3119302p3543871.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: conditionally update document on unique id

Posted by "chadsteele.com" <ch...@chadsteele.com>.
oops... the query looks more like this

http://solr/select?&q=*id:*myid.doc&rows=0

--
View this message in context: http://lucene.472066.n3.nabble.com/conditionally-update-document-on-unique-id-tp3119302p3543871.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: conditionally update document on unique id

Posted by "chadsteele.com" <ch...@chadsteele.com>.
I wanted something similar for a file crawler/uploader in c#, but don't even
want to upload the document if it exists... I'm currently querying solr
first... Is this is optimal, silly, or otherwise? 

 var url = "http://solr/select?&q=myid.doc&rows=0";
 var txt = webclient.DownloadString(url);

if (txt.Contains("numFound=\"0\"")) 
{
    //upload the file
}

--
View this message in context: http://lucene.472066.n3.nabble.com/conditionally-update-document-on-unique-id-tp3119302p3543866.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: conditionally update document on unique id

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Thu, Jun 30, 2011 at 2:06 AM, Yonik Seeley <yo...@lucidimagination.com>wrote:

> On Wed, Jun 29, 2011 at 4:32 PM, eks dev <ek...@googlemail.com> wrote:
> > req.getSearcher().getFirstMatch(t) != -1;
>
> Yep, this is currently the fastest option we have.
>
>
Just for my understanding, this method won't use any caches but still may be
faster across repeated runs for the same token? I'm asking because Eks said
that they have 50%-55% duplicate documents.

-- 
Regards,
Shalin Shekhar Mangar.

Re: conditionally update document on unique id

Posted by eks dev <ek...@googlemail.com>.
Hi Yonik,
as this recommendation comes from you, I am not going to test it, you
are well known as a speed junkie ;)

When we are there (in SignatureUpdateProcessor), why is this code not
moved to the constructor, but remains in processAdd

...
        Signature sig = (Signature)
req.getCore().getResourceLoader().newInstance(signatureClass);
        sig.init(params);
...
Should we be expecting on the fly signatureClass changes / params? I
am still not all that familiar with solr life cycles... might be
stupid question.

Thanks,
eks


On Wed, Jun 29, 2011 at 10:36 PM, Yonik Seeley
<yo...@lucidimagination.com> wrote:
> On Wed, Jun 29, 2011 at 4:32 PM, eks dev <ek...@googlemail.com> wrote:
>> req.getSearcher().getFirstMatch(t) != -1;
>
> Yep, this is currently the fastest option we have.
>
> -Yonik
> http://www.lucidimagination.com
>

Re: conditionally update document on unique id

Posted by Yonik Seeley <yo...@lucidimagination.com>.
On Wed, Jun 29, 2011 at 4:32 PM, eks dev <ek...@googlemail.com> wrote:
> req.getSearcher().getFirstMatch(t) != -1;

Yep, this is currently the fastest option we have.

-Yonik
http://www.lucidimagination.com

Re: conditionally update document on unique id

Posted by eks dev <ek...@googlemail.com>.
Thanks Shalin!

would you not expect

req.getSearcher().docFreq(t);

to be slightly faster? Or maybe even

req.getSearcher().getFirstMatch(t) != -1;

which one should be faster, any known side effects?




On Wed, Jun 29, 2011 at 1:45 PM, Shalin Shekhar Mangar
<sh...@gmail.com> wrote:
> On Wed, Jun 29, 2011 at 2:01 AM, eks dev <ek...@yahoo.co.uk> wrote:
>
>> Quick question,
>> Is there a way with solr to conditionally update document on unique
>> id? Meaning, default, add behavior if id is not already in index and
>> *not to touch index" if already there.
>>
>> Deletes are not important (no sync issues).
>>
>> I am asking because I noticed with deduplication turned on,
>> index-files get modified even if I update the same documents again
>> (same signatures).
>> I am facing very high dupes rate (40-50%), and setup is going to be
>> master-slave with high commit rate (requirement is to reduce
>> propagation latency for updates). Having unnecessary index
>> modifications is going to waste  "effort" to ship the same information
>> again and again.
>>
>> if there is no standard way, what would be the fastest way to check if
>> Term exists in index from UpdateRequestProcessor?
>>
>>
> I'd suggest that you use the searcher's getDocSet with a TermQuery.
>
> Use the SolrQueryRequest#getSearcher so you don't need to worry about ref
> counting.
>
> e.g. req.getSearcher().getDocSet(new TermQuery(new Term(signatureField,
> sigString))).size();
>
>
>
>> I intend to extend SignatureUpdateProcessor to prevent a document from
>> propagating down the chain if this happens?
>> Would that be a way to deal with it? I repeat, there are no deletes to
>> make headaches with synchronization
>>
>>
> Yes, that should be fine.
>
> --
> Regards,
> Shalin Shekhar Mangar.
>

Re: conditionally update document on unique id

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.
On Wed, Jun 29, 2011 at 2:01 AM, eks dev <ek...@yahoo.co.uk> wrote:

> Quick question,
> Is there a way with solr to conditionally update document on unique
> id? Meaning, default, add behavior if id is not already in index and
> *not to touch index" if already there.
>
> Deletes are not important (no sync issues).
>
> I am asking because I noticed with deduplication turned on,
> index-files get modified even if I update the same documents again
> (same signatures).
> I am facing very high dupes rate (40-50%), and setup is going to be
> master-slave with high commit rate (requirement is to reduce
> propagation latency for updates). Having unnecessary index
> modifications is going to waste  "effort" to ship the same information
> again and again.
>
> if there is no standard way, what would be the fastest way to check if
> Term exists in index from UpdateRequestProcessor?
>
>
I'd suggest that you use the searcher's getDocSet with a TermQuery.

Use the SolrQueryRequest#getSearcher so you don't need to worry about ref
counting.

e.g. req.getSearcher().getDocSet(new TermQuery(new Term(signatureField,
sigString))).size();



> I intend to extend SignatureUpdateProcessor to prevent a document from
> propagating down the chain if this happens?
> Would that be a way to deal with it? I repeat, there are no deletes to
> make headaches with synchronization
>
>
Yes, that should be fine.

-- 
Regards,
Shalin Shekhar Mangar.