You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by gwk <gi...@eyefi.nl> on 2009/02/23 14:29:18 UTC
Distributed Search
Hello,
The wiki states 'When duplicate doc IDs are received, Solr chooses the
first doc and discards subsequent ones', I was wondering whether "the
first doc" is the doc of the shard which responds first or the doc in
the first shard in the shards GET parameter?
Regards,
gwk
Re: Distributed Search
Posted by Chris Hostetter <ho...@fucit.org>.
: > Yes, that's the standard trick. :)
: > > Ok, so it wouldn't be possible to have a smaller, faster authoritative
: > > shard for near-real-time updates while keeping the entire dataset in a
: > > second shard which is updates less frequently?
: Ok, now I'm confused, if the shard the document comes from is
: non-deterministic, how can you use this 'trick'? (except that since the
I believe Otis's point is that many people use distributed search across
shards where some are large and mostly static and one is small and
frequently updated with new docs in order to get some performance
advantages out of hte long cache lifes on the larger shard(s) ... but this
typically works best when you only "add" new docs, and don't modify old
ones (or only modify docs added very recently so they're always in the
small shard) while the bigger shards are treated as "archives" that don't
change.
To be deterministic you can't have the same uniqueKey in multiple shards.
-Hoss
Re: Distributed Search
Posted by Mark Miller <ma...@gmail.com>.
Fair enough. We should update the Wiki then? I think it currently does
read as if its a supported feature rather than something you should avoid.
--
- Mark
http://www.lucidimagination.com
Yonik Seeley wrote:
> On Wed, Feb 25, 2009 at 11:52 AM, Mark Miller <ma...@gmail.com> wrote:
>
>> You are not supposed to have duplicates is a bit strong - I was over reading
>> into something Yonik had mentioned in the past. It looks like its supposed
>> to become more useful:
>>
>
> Well, perhaps slightly more deterministic so that two queries return
> the same results.
> I think we should stick with the position that duplicate docs in
> shards is an error, but that we handle it gracefully w/o blowing up.
> Things like facet counts, paging, etc, will be slightly off.
>
> -Yonik
> Lucene/Solr? http://www.lucidimagination.com
>
>
>
>
>> I think Yonik might have to clear this up, but it looks like the current
>> implementation is not deterministic, and he has it listed as a TODO:
>>
>> // make which duplicate is used deterministic based on shard
>> // if (prevShard.compareTo(srsp.shard) >= 0) {
>> // TODO: remove previous from priority queue
>> // continue;
>> // }
>>
>>
>> Mark Miller wrote:
>>
>>> I don't think your supposed to have duplicate keys? I think its supposed
>>> to work more as a graceful failure than a feature you should count on. Id's
>>> should be unique across the collection.
>>>
>>>
>>>> Ok, now I'm confused, if the shard the document comes from is
>>>> non-deterministic, how can you use this 'trick'? (except that since the
>>>> response time of the first shard which is smaller is usually better which
>>>> would mean it'll work most of time (BAD!)) Or was Koji's memory incorrect
>>>> and the shard first mentioned is always the authoritative shard when
>>>> encountering duplicate keys?
>>>>
>>>> Regards,
>>>>
>>>> gwk
>>>>
>>>>
>>>
>> --
>> - Mark
>>
>> http://www.lucidimagination.com
>>
>>
>>
>>
>>
Re: Distributed Search
Posted by Yonik Seeley <ys...@gmail.com>.
On Wed, Feb 25, 2009 at 11:52 AM, Mark Miller <ma...@gmail.com> wrote:
> You are not supposed to have duplicates is a bit strong - I was over reading
> into something Yonik had mentioned in the past. It looks like its supposed
> to become more useful:
Well, perhaps slightly more deterministic so that two queries return
the same results.
I think we should stick with the position that duplicate docs in
shards is an error, but that we handle it gracefully w/o blowing up.
Things like facet counts, paging, etc, will be slightly off.
-Yonik
Lucene/Solr? http://www.lucidimagination.com
> I think Yonik might have to clear this up, but it looks like the current
> implementation is not deterministic, and he has it listed as a TODO:
>
> // make which duplicate is used deterministic based on shard
> // if (prevShard.compareTo(srsp.shard) >= 0) {
> // TODO: remove previous from priority queue
> // continue;
> // }
>
>
> Mark Miller wrote:
>>
>> I don't think your supposed to have duplicate keys? I think its supposed
>> to work more as a graceful failure than a feature you should count on. Id's
>> should be unique across the collection.
>>
>>>>
>>>
>>> Ok, now I'm confused, if the shard the document comes from is
>>> non-deterministic, how can you use this 'trick'? (except that since the
>>> response time of the first shard which is smaller is usually better which
>>> would mean it'll work most of time (BAD!)) Or was Koji's memory incorrect
>>> and the shard first mentioned is always the authoritative shard when
>>> encountering duplicate keys?
>>>
>>> Regards,
>>>
>>> gwk
>>>
>>
>>
>
>
> --
> - Mark
>
> http://www.lucidimagination.com
>
>
>
>
Re: Distributed Search
Posted by Mark Miller <ma...@gmail.com>.
You are not supposed to have duplicates is a bit strong - I was over
reading into something Yonik had mentioned in the past. It looks like
its supposed to become more useful:
I think Yonik might have to clear this up, but it looks like the current
implementation is not deterministic, and he has it listed as a TODO:
// make which duplicate is used deterministic based on shard
// if (prevShard.compareTo(srsp.shard) >= 0) {
// TODO: remove previous from priority queue
// continue;
// }
Mark Miller wrote:
> I don't think your supposed to have duplicate keys? I think its
> supposed to work more as a graceful failure than a feature you should
> count on. Id's should be unique across the collection.
>
>>>
>> Ok, now I'm confused, if the shard the document comes from is
>> non-deterministic, how can you use this 'trick'? (except that since
>> the response time of the first shard which is smaller is usually
>> better which would mean it'll work most of time (BAD!)) Or was Koji's
>> memory incorrect and the shard first mentioned is always the
>> authoritative shard when encountering duplicate keys?
>>
>> Regards,
>>
>> gwk
>>
>
>
--
- Mark
http://www.lucidimagination.com
Re: Distributed Search
Posted by Mark Miller <ma...@gmail.com>.
I don't think your supposed to have duplicate keys? I think its supposed
to work more as a graceful failure than a feature you should count on.
Id's should be unique across the collection.
>>
> Ok, now I'm confused, if the shard the document comes from is
> non-deterministic, how can you use this 'trick'? (except that since
> the response time of the first shard which is smaller is usually
> better which would mean it'll work most of time (BAD!)) Or was Koji's
> memory incorrect and the shard first mentioned is always the
> authoritative shard when encountering duplicate keys?
>
> Regards,
>
> gwk
>
--
- Mark
http://www.lucidimagination.com
Re: Distributed Search
Posted by gwk <gi...@eyefi.nl>.
Otis Gospodnetic wrote:
> Yes, that's the standard trick. :)
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> ----- Original Message ----
>
>> From: gwk <gi...@eyefi.nl>
>> To: solr-user@lucene.apache.org
>> Sent: Wednesday, February 25, 2009 5:18:47 AM
>> Subject: Re: Distributed Search
>>
>> Koji Sekiguchi wrote:
>>
>>> gwk wrote:
>>>
>>>> Hello,
>>>>
>>>> The wiki states 'When duplicate doc IDs are received, Solr chooses the first
>>>>
>> doc and discards subsequent ones', I was wondering whether "the first doc" is
>> the doc of the shard which responds first or the doc in the first shard in the
>> shards GET parameter?
>>
>>>> Regards,
>>>>
>>>> gwk
>>>>
>>>>
>>> It is the doc of the shard which responds first, if my memory is correct...
>>>
>>> Koji
>>>
>>>
>>>
>> Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for
>> near-real-time updates while keeping the entire dataset in a second shard which
>> is updates less frequently?
>>
>> Regards,
>>
>> gwk
>>
>
>
Ok, now I'm confused, if the shard the document comes from is
non-deterministic, how can you use this 'trick'? (except that since the
response time of the first shard which is smaller is usually better
which would mean it'll work most of time (BAD!)) Or was Koji's memory
incorrect and the shard first mentioned is always the authoritative
shard when encountering duplicate keys?
Regards,
gwk
Re: Distributed Search
Posted by Otis Gospodnetic <ot...@yahoo.com>.
Yes, that's the standard trick. :)
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
> From: gwk <gi...@eyefi.nl>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, February 25, 2009 5:18:47 AM
> Subject: Re: Distributed Search
>
> Koji Sekiguchi wrote:
> > gwk wrote:
> >> Hello,
> >>
> >> The wiki states 'When duplicate doc IDs are received, Solr chooses the first
> doc and discards subsequent ones', I was wondering whether "the first doc" is
> the doc of the shard which responds first or the doc in the first shard in the
> shards GET parameter?
> >>
> >> Regards,
> >>
> >> gwk
> >>
> >
> > It is the doc of the shard which responds first, if my memory is correct...
> >
> > Koji
> >
> >
> Ok, so it wouldn't be possible to have a smaller, faster authoritative shard for
> near-real-time updates while keeping the entire dataset in a second shard which
> is updates less frequently?
>
> Regards,
>
> gwk
Re: Distributed Search
Posted by gwk <gi...@eyefi.nl>.
Koji Sekiguchi wrote:
> gwk wrote:
>> Hello,
>>
>> The wiki states 'When duplicate doc IDs are received, Solr chooses
>> the first doc and discards subsequent ones', I was wondering whether
>> "the first doc" is the doc of the shard which responds first or the
>> doc in the first shard in the shards GET parameter?
>>
>> Regards,
>>
>> gwk
>>
>
> It is the doc of the shard which responds first, if my memory is
> correct...
>
> Koji
>
>
Ok, so it wouldn't be possible to have a smaller, faster authoritative
shard for near-real-time updates while keeping the entire dataset in a
second shard which is updates less frequently?
Regards,
gwk
Re: Distributed Search
Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
gwk wrote:
> Hello,
>
> The wiki states 'When duplicate doc IDs are received, Solr chooses the
> first doc and discards subsequent ones', I was wondering whether "the
> first doc" is the doc of the shard which responds first or the doc in
> the first shard in the shards GET parameter?
>
> Regards,
>
> gwk
>
It is the doc of the shard which responds first, if my memory is correct...
Koji