You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by gwk <gi...@eyefi.nl> on 2009/02/23 14:29:18 UTC

Distributed Search

Hello,

The wiki states 'When duplicate doc IDs are received, Solr chooses the 
first doc and discards subsequent ones', I was wondering whether "the 
first doc" is the doc of the shard which responds first or the doc in 
the first shard in the shards GET parameter?

Regards,

gwk

Re: Distributed Search

Posted by Chris Hostetter <ho...@fucit.org>.

: > Yes, that's the standard trick. :)

: > > Ok, so it wouldn't be possible to have a smaller, faster authoritative
: > > shard for near-real-time updates while keeping the entire dataset in a
: > > second shard which is updates less frequently?

: Ok, now I'm confused, if the shard the document comes from is
: non-deterministic, how can you use this 'trick'? (except that since the

I believe Otis's point is that many people use distributed search across 
shards where some are large and mostly static and one is small and 
frequently updated with new docs in order to get some performance 
advantages out of hte long cache lifes on the larger shard(s) ... but this 
typically works best when you only "add" new docs, and don't modify old 
ones (or only modify docs added very recently so they're always in the 
small shard) while the bigger shards are treated as "archives" that don't 
change.

To be deterministic you can't have the same uniqueKey in multiple shards.




-Hoss

Re: Distributed Search

Posted by Mark Miller <ma...@gmail.com>.

Fair enough. We should update the Wiki then? I think it currently does 
read as if its a supported feature rather than something you should avoid.

-- 
- Mark

http://www.lucidimagination.com



Yonik Seeley wrote:
> On Wed, Feb 25, 2009 at 11:52 AM, Mark Miller <ma...@gmail.com> wrote:
>   
>> You are not supposed to have duplicates is a bit strong - I was over reading
>> into something Yonik had mentioned in the past. It looks like its supposed
>> to become more useful:
>>     
>
> Well, perhaps slightly more deterministic so that two queries return
> the same results.
> I think we should stick with the position that duplicate docs in
> shards is an error, but that we handle it gracefully w/o blowing up.
> Things like facet counts, paging, etc, will be slightly off.
>
> -Yonik
> Lucene/Solr? http://www.lucidimagination.com
>
>
>
>   
>> I think Yonik might have to clear this up, but it looks like the current
>> implementation is not deterministic, and he has it listed as a TODO:
>>
>>           // make which duplicate is used deterministic based on shard
>>           // if (prevShard.compareTo(srsp.shard) >= 0) {
>>           //  TODO: remove previous from priority queue
>>           //  continue;
>>           // }
>>
>>
>> Mark Miller wrote:
>>     
>>> I don't think your supposed to have duplicate keys? I think its supposed
>>> to work more as a graceful failure than a feature you should count on. Id's
>>> should be unique across the collection.
>>>
>>>       
>>>> Ok, now I'm confused, if the shard the document comes from is
>>>> non-deterministic, how can you use this 'trick'? (except that since the
>>>> response time of the first shard which is smaller is usually better which
>>>> would mean it'll work most of time (BAD!)) Or was Koji's memory incorrect
>>>> and the shard first mentioned is always the authoritative shard when
>>>> encountering duplicate keys?
>>>>
>>>> Regards,
>>>>
>>>> gwk
>>>>
>>>>         
>>>       
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>

Re: Distributed Search

Posted by Yonik Seeley <ys...@gmail.com>.

On Wed, Feb 25, 2009 at 11:52 AM, Mark Miller <ma...@gmail.com> wrote:
> You are not supposed to have duplicates is a bit strong - I was over reading
> into something Yonik had mentioned in the past. It looks like its supposed
> to become more useful:

Well, perhaps slightly more deterministic so that two queries return
the same results.
I think we should stick with the position that duplicate docs in
shards is an error, but that we handle it gracefully w/o blowing up.
Things like facet counts, paging, etc, will be slightly off.

-Yonik
Lucene/Solr? http://www.lucidimagination.com



> I think Yonik might have to clear this up, but it looks like the current
> implementation is not deterministic, and he has it listed as a TODO:
>
>           // make which duplicate is used deterministic based on shard
>           // if (prevShard.compareTo(srsp.shard) >= 0) {
>           //  TODO: remove previous from priority queue
>           //  continue;
>           // }
>
>
> Mark Miller wrote:
>>
>> I don't think your supposed to have duplicate keys? I think its supposed
>> to work more as a graceful failure than a feature you should count on. Id's
>> should be unique across the collection.
>>
>>>>
>>>
>>> Ok, now I'm confused, if the shard the document comes from is
>>> non-deterministic, how can you use this 'trick'? (except that since the
>>> response time of the first shard which is smaller is usually better which
>>> would mean it'll work most of time (BAD!)) Or was Koji's memory incorrect
>>> and the shard first mentioned is always the authoritative shard when
>>> encountering duplicate keys?
>>>
>>> Regards,
>>>
>>> gwk
>>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>

Re: Distributed Search

Posted by Mark Miller <ma...@gmail.com>.

You are not supposed to have duplicates is a bit strong - I was over 
reading into something Yonik had mentioned in the past. It looks like 
its supposed to become more useful:

I think Yonik might have to clear this up, but it looks like the current 
implementation is not deterministic, and he has it listed as a TODO:

            // make which duplicate is used deterministic based on shard
            // if (prevShard.compareTo(srsp.shard) >= 0) {
            //  TODO: remove previous from priority queue
            //  continue;
            // }


Mark Miller wrote:
> I don't think your supposed to have duplicate keys? I think its 
> supposed to work more as a graceful failure than a feature you should 
> count on. Id's should be unique across the collection.
>
>>>  
>> Ok, now I'm confused, if the shard the document comes from is 
>> non-deterministic, how can you use this 'trick'? (except that since 
>> the response time of the first shard which is smaller is usually 
>> better which would mean it'll work most of time (BAD!)) Or was Koji's 
>> memory incorrect and the shard first mentioned is always the 
>> authoritative shard when encountering duplicate keys?
>>
>> Regards,
>>
>> gwk
>>
>
>


-- 
- Mark

http://www.lucidimagination.com

Re: Distributed Search

Posted by Mark Miller <ma...@gmail.com>.

I don't think your supposed to have duplicate keys? I think its supposed 
to work more as a graceful failure than a feature you should count on. 
Id's should be unique across the collection.

>>  
> Ok, now I'm confused, if the shard the document comes from is 
> non-deterministic, how can you use this 'trick'? (except that since 
> the response time of the first shard which is smaller is usually 
> better which would mean it'll work most of time (BAD!)) Or was Koji's 
> memory incorrect and the shard first mentioned is always the 
> authoritative shard when encountering duplicate keys?
>
> Regards,
>
> gwk
>


-- 
- Mark

http://www.lucidimagination.com

Re: Distributed Search

Posted by gwk <gi...@eyefi.nl>.

Otis Gospodnetic wrote:
> Yes, that's the standard trick. :)
>
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>   
>> From: gwk <gi...@eyefi.nl>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, February 25, 2009 5:18:47 AM
>> Subject: Re: Distributed Search
>>
>> Koji Sekiguchi wrote:
>>     
>>> gwk wrote:
>>>       
>>>> Hello,
>>>>
>>>> The wiki states 'When duplicate doc IDs are received, Solr chooses the first 
>>>>         
>> doc and discards subsequent ones', I was wondering whether "the first doc" is 
>> the doc of the shard which responds first or the doc in the first shard in the 
>> shards GET parameter?
>>     
>>>> Regards,
>>>>
>>>> gwk
>>>>
>>>>         
>>> It is the doc of the shard which responds first, if my memory is correct...
>>>
>>> Koji
>>>
>>>
>>>       
>> Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for 
>> near-real-time updates while keeping the entire dataset in a second shard which 
>> is updates less frequently?
>>
>> Regards,
>>
>> gwk
>>     
>
>   
Ok, now I'm confused, if the shard the document comes from is 
non-deterministic, how can you use this 'trick'? (except that since the 
response time of the first shard which is smaller is usually better 
which would mean it'll work most of time (BAD!)) Or was Koji's memory 
incorrect and the shard first mentioned is always the authoritative 
shard when encountering duplicate keys?

Regards,

gwk

Re: Distributed Search

Posted by Otis Gospodnetic <ot...@yahoo.com>.

Yes, that's the standard trick. :)

 Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: gwk <gi...@eyefi.nl>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, February 25, 2009 5:18:47 AM
> Subject: Re: Distributed Search
> 
> Koji Sekiguchi wrote:
> > gwk wrote:
> >> Hello,
> >> 
> >> The wiki states 'When duplicate doc IDs are received, Solr chooses the first 
> doc and discards subsequent ones', I was wondering whether "the first doc" is 
> the doc of the shard which responds first or the doc in the first shard in the 
> shards GET parameter?
> >> 
> >> Regards,
> >> 
> >> gwk
> >> 
> > 
> > It is the doc of the shard which responds first, if my memory is correct...
> > 
> > Koji
> > 
> > 
> Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for 
> near-real-time updates while keeping the entire dataset in a second shard which 
> is updates less frequently?
> 
> Regards,
> 
> gwk

Re: Distributed Search

Posted by gwk <gi...@eyefi.nl>.

Koji Sekiguchi wrote:
> gwk wrote:
>> Hello,
>>
>> The wiki states 'When duplicate doc IDs are received, Solr chooses 
>> the first doc and discards subsequent ones', I was wondering whether 
>> "the first doc" is the doc of the shard which responds first or the 
>> doc in the first shard in the shards GET parameter?
>>
>> Regards,
>>
>> gwk
>>
>
> It is the doc of the shard which responds first, if my memory is 
> correct...
>
> Koji
>
>
Ok, so it wouldn't be possible to have a smaller, faster authoritative 
shard for near-real-time updates while keeping the entire dataset in a 
second shard which is updates less frequently?

Regards,

gwk

Re: Distributed Search

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

gwk wrote:
> Hello,
>
> The wiki states 'When duplicate doc IDs are received, Solr chooses the 
> first doc and discards subsequent ones', I was wondering whether "the 
> first doc" is the doc of the shard which responds first or the doc in 
> the first shard in the shards GET parameter?
>
> Regards,
>
> gwk
>

It is the doc of the shard which responds first, if my memory is correct...

Koji