You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Luis Cappa Banda <lu...@gmail.com> on 2013/05/23 09:51:00 UTC

Distributed query: strange behavior.

Hello, guys!

I'm running Solr 4.3.0 and I've notice an strange behavior during
distributed queries execution. Currently I have three Solr servers as
shards and I when I do the following query...


http://localhost:11080/twitter/data/select?&q=*:*&*rows=10*
&&shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/data&wt=json<http://localhost:11080/twitter/data/select?&q=*:*&rows=10&sort=docIndexDate%20desc&shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/data&wt=json>

*Numfound* = 47131


I've query each Solr shard server one by one and the total number of
documents is correct. However, when I change rows parameter from 10 to 100
the total numFound of documents change:

http://localhost:11080/twitter/data/select?&q=*:*&*rows=100*
&&shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/data&wt=json<http://localhost:11080/twitter/data/select?&q=*:*&rows=10&sort=docIndexDate%20desc&shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/data&wt=json>

*Numfound* = 47124

And if i set rows=50 again the numFound count changes:

http://localhost:11080/twitter/data/select?&q=*:*&rows=50&shards=localhost:11080/twitter/data,localhost:12080/twitter/data,localhost:13080/twitter/data&wt=json

*Numfound* = 47129


What's happening here? Anybody knows? It's a distributed search bug or
something?

Thank you very much in advance!


Best regards,

-- 
- Luis Cappa

Re: Distributed query: strange behavior.

Posted by Luis Cappa Banda <lu...@gmail.com>.

Uhm... that sounds reasonable. My data model may allow duplicate keys, but
it's quite difficult. My key is a hash formed by an URL during a crawling
process, and it's posible to re-crawl an existing URL. I think that I need
to find a new way to compose an unique key to avoid this kind of bad
behavior. However, that would be very useful if can Solr alert about
duplicate keys or something. Maybe an extra parameter included as a field
in the response plus numFound, docs, facets, etc. would be nice. Thank you
very much!

Best regards,

- Luis Cappa


2013/5/23 Shawn Heisey <so...@elyograg.org>

> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
> > I've query each Solr shard server one by one and the total number of
> > documents is correct. However, when I change rows parameter from 10 to
> 100
> > the total numFound of documents change:
>
> I've seen this problem on the list before and the cause has been
> determined each time to be caused by documents with the same uniqueKey
> value appearing in more than one shard.
>
> What I think happens here:
>
> With rows=10, you get the top ten docs from each of the three shards,
> and each shard sends its numFound for that query to the core that's
> coordinating the search.  The coordinator adds up numFound, looks
> through those thirty docs, and arranges them according to the requested
> sort order, returning only the top 10.  In this case, there happen to be
> no duplicates.
>
> With rows=100, you get a total of 300 docs.  This time, duplicates are
> found and removed by the coordinator.  I think that the coordinator
> adjusts the total numFound by the number of duplicate documents it
> removed, in an attempt to be more accurate.
>
> I don't know if adjusting numFound when duplicates are found in a
> sharded query is the right thing to do, I'll leave that for smarter
> people.  Perhaps Solr should return a message with the results saying
> that duplicates were found, and if a config option is not enabled, the
> server should throw an exception and return a 4xx HTTP error code.  One
> idea for a config parameter name would be allowShardDuplicates, but
> something better can probably be found.
>
> Thanks,
> Shawn
>
>


-- 
- Luis Cappa

Re: Distributed query: strange behavior.

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

The uniqueKey is enforced within the same shard/index only.


On Fri, May 24, 2013 at 6:39 PM, Valery Giner <va...@research.att.com>wrote:

> Shawn,
>
> How is it possible for more than one document with the same unique key to
> appear in the index, even in different shards?
> Isn't it a bug by definition?
> What am I missing here?
>
> Thanks,
> Val
>
>
> On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>
>> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>>
>>> I've query each Solr shard server one by one and the total number of
>>> documents is correct. However, when I change rows parameter from 10 to
>>> 100
>>> the total numFound of documents change:
>>>
>> I've seen this problem on the list before and the cause has been
>> determined each time to be caused by documents with the same uniqueKey
>> value appearing in more than one shard.
>>
>> What I think happens here:
>>
>> With rows=10, you get the top ten docs from each of the three shards,
>> and each shard sends its numFound for that query to the core that's
>> coordinating the search.  The coordinator adds up numFound, looks
>> through those thirty docs, and arranges them according to the requested
>> sort order, returning only the top 10.  In this case, there happen to be
>> no duplicates.
>>
>> With rows=100, you get a total of 300 docs.  This time, duplicates are
>> found and removed by the coordinator.  I think that the coordinator
>> adjusts the total numFound by the number of duplicate documents it
>> removed, in an attempt to be more accurate.
>>
>> I don't know if adjusting numFound when duplicates are found in a
>> sharded query is the right thing to do, I'll leave that for smarter
>> people.  Perhaps Solr should return a message with the results saying
>> that duplicates were found, and if a config option is not enabled, the
>> server should throw an exception and return a 4xx HTTP error code.  One
>> idea for a config parameter name would be allowShardDuplicates, but
>> something better can probably be found.
>>
>> Thanks,
>> Shawn
>>
>>
>


-- 
Regards,
Shalin Shekhar Mangar.

Re: Distributed query: strange behavior.

Posted by Luis Cappa Banda <lu...@gmail.com>.

Hello, guys!

Well, I've done some tests and I think that there exists some kind of bug
related with distributed search. Currently I'm setting a key field that
it's impossible to be duplicated, and I have experienced the same wrong
behavior with numFound field while changing rows parameter. Has anyone
experienced the same?

Best regards,

- Luis Cappa


2013/5/27 Luis Cappa Banda <lu...@gmail.com>

> Hi, Erick!
>
> That's it! I'm using a custom implementation of a SolrServer with
> distributed behavior that routes queries and updates using an in-house
> Round Robin method. But the thing is that I'm doing this myself because
> I've noticed that duplicated documents appears using LBHttpSolrServer
> implementation. Last week I modified my implementation to avoid that with
> this changes:
>
>
>    - I have normalized the key field to all documents. Now every document
>    indexed must include *_id_* field that stores the selected key value.
>    The value is setted with a *copyField*.
>    - When I index a new document a *HttpSolrServer* from the shard list
>    is selected using a Round Robin strategy. Then, a field called *_shard_
>    * is setted to *SolrInputDocument*. That field value includes a
>    relationship with the main shard selected.
>    - If a document wants to be indexed/updated and it includes *_shard_*field to update it automatically the belonged shard (
>    *HttpSolrServer*) is selected.
>    - If a document wants to be indexed/updated and *_shard_* field is not
>    included then the key value from *_id_* is getted from *
>    SolrInputDocument*. With that key a distributed search query is
>    executed by it's key to retrieve *_shard_* field. With *_shard_* field
>    we can now choose the correct shard (*HttpSolrServer*). It's not a
>    good practice and performance isn't the best, but it's secure.
>
> Best Regards,
>
> - Luis Cappa
>
>
> 2013/5/26 Erick Erickson <er...@gmail.com>
>
>> Valery:
>>
>> I share your puzzlement. _If_ you are letting Solr do the document
>> routing, and not doing any of the custom routing, then the same unique
>> key should be going to the same shard and replacing the previous doc
>> with that key.
>>
>> But, if you're using custom routing, if you've been experimenting with
>> different configurations and didn't start over, in general if you're
>> configuration is in an "interesting" state this could happen.
>>
>> So in the normal case if you have a document with the same key indexed
>> in multiple shards, that would indicate a bug. But there are many
>> ways, especially when experimenting, that you could have this happen
>> which are _not_ a bug. I'm guessing that Luis may be trying the custom
>> routing option maybe?
>>
>> Best
>> Erick
>>
>> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <va...@research.att.com>
>> wrote:
>> > Shawn,
>> >
>> > How is it possible for more than one document with the same unique key
>> to
>> > appear in the index, even in different shards?
>> > Isn't it a bug by definition?
>> > What am I missing here?
>> >
>> > Thanks,
>> > Val
>> >
>> >
>> > On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>> >>
>> >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>> >>>
>> >>> I've query each Solr shard server one by one and the total number of
>> >>> documents is correct. However, when I change rows parameter from 10 to
>> >>> 100
>> >>> the total numFound of documents change:
>> >>
>> >> I've seen this problem on the list before and the cause has been
>> >> determined each time to be caused by documents with the same uniqueKey
>> >> value appearing in more than one shard.
>> >>
>> >> What I think happens here:
>> >>
>> >> With rows=10, you get the top ten docs from each of the three shards,
>> >> and each shard sends its numFound for that query to the core that's
>> >> coordinating the search.  The coordinator adds up numFound, looks
>> >> through those thirty docs, and arranges them according to the requested
>> >> sort order, returning only the top 10.  In this case, there happen to
>> be
>> >> no duplicates.
>> >>
>> >> With rows=100, you get a total of 300 docs.  This time, duplicates are
>> >> found and removed by the coordinator.  I think that the coordinator
>> >> adjusts the total numFound by the number of duplicate documents it
>> >> removed, in an attempt to be more accurate.
>> >>
>> >> I don't know if adjusting numFound when duplicates are found in a
>> >> sharded query is the right thing to do, I'll leave that for smarter
>> >> people.  Perhaps Solr should return a message with the results saying
>> >> that duplicates were found, and if a config option is not enabled, the
>> >> server should throw an exception and return a 4xx HTTP error code.  One
>> >> idea for a config parameter name would be allowShardDuplicates, but
>> >> something better can probably be found.
>> >>
>> >> Thanks,
>> >> Shawn
>> >>
>> >
>>
>
>
>
> --
> - Luis Cappa
>



-- 
- Luis Cappa

Re: Distributed query: strange behavior.

Posted by Luis Cappa Banda <lu...@gmail.com>.

Hi, Erick!

That's it! I'm using a custom implementation of a SolrServer with
distributed behavior that routes queries and updates using an in-house
Round Robin method. But the thing is that I'm doing this myself because
I've noticed that duplicated documents appears using LBHttpSolrServer
implementation. Last week I modified my implementation to avoid that with
this changes:


   - I have normalized the key field to all documents. Now every document
   indexed must include *_id_* field that stores the selected key value.
   The value is setted with a *copyField*.
   - When I index a new document a *HttpSolrServer* from the shard list is
   selected using a Round Robin strategy. Then, a field called *_shard_* is
   setted to *SolrInputDocument*. That field value includes a relationship
   with the main shard selected.
   - If a document wants to be indexed/updated and it includes
*_shard_*field to update it automatically the belonged shard (
   *HttpSolrServer*) is selected.
   - If a document wants to be indexed/updated and *_shard_* field is not
   included then the key value from *_id_* is getted from *SolrInputDocument
   *. With that key a distributed search query is executed by it's key to
   retrieve *_shard_* field. With *_shard_* field we can now choose the
   correct shard (*HttpSolrServer*). It's not a good practice and
   performance isn't the best, but it's secure.

Best Regards,

- Luis Cappa


2013/5/26 Erick Erickson <er...@gmail.com>

> Valery:
>
> I share your puzzlement. _If_ you are letting Solr do the document
> routing, and not doing any of the custom routing, then the same unique
> key should be going to the same shard and replacing the previous doc
> with that key.
>
> But, if you're using custom routing, if you've been experimenting with
> different configurations and didn't start over, in general if you're
> configuration is in an "interesting" state this could happen.
>
> So in the normal case if you have a document with the same key indexed
> in multiple shards, that would indicate a bug. But there are many
> ways, especially when experimenting, that you could have this happen
> which are _not_ a bug. I'm guessing that Luis may be trying the custom
> routing option maybe?
>
> Best
> Erick
>
> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <va...@research.att.com>
> wrote:
> > Shawn,
> >
> > How is it possible for more than one document with the same unique key to
> > appear in the index, even in different shards?
> > Isn't it a bug by definition?
> > What am I missing here?
> >
> > Thanks,
> > Val
> >
> >
> > On 05/23/2013 09:55 AM, Shawn Heisey wrote:
> >>
> >> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
> >>>
> >>> I've query each Solr shard server one by one and the total number of
> >>> documents is correct. However, when I change rows parameter from 10 to
> >>> 100
> >>> the total numFound of documents change:
> >>
> >> I've seen this problem on the list before and the cause has been
> >> determined each time to be caused by documents with the same uniqueKey
> >> value appearing in more than one shard.
> >>
> >> What I think happens here:
> >>
> >> With rows=10, you get the top ten docs from each of the three shards,
> >> and each shard sends its numFound for that query to the core that's
> >> coordinating the search.  The coordinator adds up numFound, looks
> >> through those thirty docs, and arranges them according to the requested
> >> sort order, returning only the top 10.  In this case, there happen to be
> >> no duplicates.
> >>
> >> With rows=100, you get a total of 300 docs.  This time, duplicates are
> >> found and removed by the coordinator.  I think that the coordinator
> >> adjusts the total numFound by the number of duplicate documents it
> >> removed, in an attempt to be more accurate.
> >>
> >> I don't know if adjusting numFound when duplicates are found in a
> >> sharded query is the right thing to do, I'll leave that for smarter
> >> people.  Perhaps Solr should return a message with the results saying
> >> that duplicates were found, and if a config option is not enabled, the
> >> server should throw an exception and return a 4xx HTTP error code.  One
> >> idea for a config parameter name would be allowShardDuplicates, but
> >> something better can probably be found.
> >>
> >> Thanks,
> >> Shawn
> >>
> >
>



-- 
- Luis Cappa

Re: Distributed query: strange behavior.

Posted by Valery Giner <va...@research.att.com>.

Eric,

Thank you for the explanation.

My problem was that allowing the docs with the same unique ids  to be 
present in the multiple shards in a "normal" situation,
makes it impossible to estimate the number of shards needed for an index 
with a "really large" number of docs.

Thanks,
Val

On 05/26/2013 11:16 AM, Erick Erickson wrote:
> Valery:
>
> I share your puzzlement. _If_ you are letting Solr do the document
> routing, and not doing any of the custom routing, then the same unique
> key should be going to the same shard and replacing the previous doc
> with that key.
>
> But, if you're using custom routing, if you've been experimenting with
> different configurations and didn't start over, in general if you're
> configuration is in an "interesting" state this could happen.
>
> So in the normal case if you have a document with the same key indexed
> in multiple shards, that would indicate a bug. But there are many
> ways, especially when experimenting, that you could have this happen
> which are _not_ a bug. I'm guessing that Luis may be trying the custom
> routing option maybe?
>
> Best
> Erick
>
> On Fri, May 24, 2013 at 9:09 AM, Valery Giner <va...@research.att.com> wrote:
>> Shawn,
>>
>> How is it possible for more than one document with the same unique key to
>> appear in the index, even in different shards?
>> Isn't it a bug by definition?
>> What am I missing here?
>>
>> Thanks,
>> Val
>>
>>
>> On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>>> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>>>> I've query each Solr shard server one by one and the total number of
>>>> documents is correct. However, when I change rows parameter from 10 to
>>>> 100
>>>> the total numFound of documents change:
>>> I've seen this problem on the list before and the cause has been
>>> determined each time to be caused by documents with the same uniqueKey
>>> value appearing in more than one shard.
>>>
>>> What I think happens here:
>>>
>>> With rows=10, you get the top ten docs from each of the three shards,
>>> and each shard sends its numFound for that query to the core that's
>>> coordinating the search.  The coordinator adds up numFound, looks
>>> through those thirty docs, and arranges them according to the requested
>>> sort order, returning only the top 10.  In this case, there happen to be
>>> no duplicates.
>>>
>>> With rows=100, you get a total of 300 docs.  This time, duplicates are
>>> found and removed by the coordinator.  I think that the coordinator
>>> adjusts the total numFound by the number of duplicate documents it
>>> removed, in an attempt to be more accurate.
>>>
>>> I don't know if adjusting numFound when duplicates are found in a
>>> sharded query is the right thing to do, I'll leave that for smarter
>>> people.  Perhaps Solr should return a message with the results saying
>>> that duplicates were found, and if a config option is not enabled, the
>>> server should throw an exception and return a 4xx HTTP error code.  One
>>> idea for a config parameter name would be allowShardDuplicates, but
>>> something better can probably be found.
>>>
>>> Thanks,
>>> Shawn
>>>

Re: Distributed query: strange behavior.

Posted by Erick Erickson <er...@gmail.com>.

Valery:

I share your puzzlement. _If_ you are letting Solr do the document
routing, and not doing any of the custom routing, then the same unique
key should be going to the same shard and replacing the previous doc
with that key.

But, if you're using custom routing, if you've been experimenting with
different configurations and didn't start over, in general if you're
configuration is in an "interesting" state this could happen.

So in the normal case if you have a document with the same key indexed
in multiple shards, that would indicate a bug. But there are many
ways, especially when experimenting, that you could have this happen
which are _not_ a bug. I'm guessing that Luis may be trying the custom
routing option maybe?

Best
Erick

On Fri, May 24, 2013 at 9:09 AM, Valery Giner <va...@research.att.com> wrote:
> Shawn,
>
> How is it possible for more than one document with the same unique key to
> appear in the index, even in different shards?
> Isn't it a bug by definition?
> What am I missing here?
>
> Thanks,
> Val
>
>
> On 05/23/2013 09:55 AM, Shawn Heisey wrote:
>>
>> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>>>
>>> I've query each Solr shard server one by one and the total number of
>>> documents is correct. However, when I change rows parameter from 10 to
>>> 100
>>> the total numFound of documents change:
>>
>> I've seen this problem on the list before and the cause has been
>> determined each time to be caused by documents with the same uniqueKey
>> value appearing in more than one shard.
>>
>> What I think happens here:
>>
>> With rows=10, you get the top ten docs from each of the three shards,
>> and each shard sends its numFound for that query to the core that's
>> coordinating the search.  The coordinator adds up numFound, looks
>> through those thirty docs, and arranges them according to the requested
>> sort order, returning only the top 10.  In this case, there happen to be
>> no duplicates.
>>
>> With rows=100, you get a total of 300 docs.  This time, duplicates are
>> found and removed by the coordinator.  I think that the coordinator
>> adjusts the total numFound by the number of duplicate documents it
>> removed, in an attempt to be more accurate.
>>
>> I don't know if adjusting numFound when duplicates are found in a
>> sharded query is the right thing to do, I'll leave that for smarter
>> people.  Perhaps Solr should return a message with the results saying
>> that duplicates were found, and if a config option is not enabled, the
>> server should throw an exception and return a 4xx HTTP error code.  One
>> idea for a config parameter name would be allowShardDuplicates, but
>> something better can probably be found.
>>
>> Thanks,
>> Shawn
>>
>

Re: Distributed query: strange behavior.

Posted by Valery Giner <va...@research.att.com>.

Shawn,

How is it possible for more than one document with the same unique key 
to appear in the index, even in different shards?
Isn't it a bug by definition?
What am I missing here?

Thanks,
Val

On 05/23/2013 09:55 AM, Shawn Heisey wrote:
> On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
>> I've query each Solr shard server one by one and the total number of
>> documents is correct. However, when I change rows parameter from 10 to 100
>> the total numFound of documents change:
> I've seen this problem on the list before and the cause has been
> determined each time to be caused by documents with the same uniqueKey
> value appearing in more than one shard.
>
> What I think happens here:
>
> With rows=10, you get the top ten docs from each of the three shards,
> and each shard sends its numFound for that query to the core that's
> coordinating the search.  The coordinator adds up numFound, looks
> through those thirty docs, and arranges them according to the requested
> sort order, returning only the top 10.  In this case, there happen to be
> no duplicates.
>
> With rows=100, you get a total of 300 docs.  This time, duplicates are
> found and removed by the coordinator.  I think that the coordinator
> adjusts the total numFound by the number of duplicate documents it
> removed, in an attempt to be more accurate.
>
> I don't know if adjusting numFound when duplicates are found in a
> sharded query is the right thing to do, I'll leave that for smarter
> people.  Perhaps Solr should return a message with the results saying
> that duplicates were found, and if a config option is not enabled, the
> server should throw an exception and return a 4xx HTTP error code.  One
> idea for a config parameter name would be allowShardDuplicates, but
> something better can probably be found.
>
> Thanks,
> Shawn
>

Re: Distributed query: strange behavior.

Posted by Shawn Heisey <so...@elyograg.org>.

On 5/23/2013 1:51 AM, Luis Cappa Banda wrote:
> I've query each Solr shard server one by one and the total number of
> documents is correct. However, when I change rows parameter from 10 to 100
> the total numFound of documents change:

I've seen this problem on the list before and the cause has been
determined each time to be caused by documents with the same uniqueKey
value appearing in more than one shard.

What I think happens here:

With rows=10, you get the top ten docs from each of the three shards,
and each shard sends its numFound for that query to the core that's
coordinating the search.  The coordinator adds up numFound, looks
through those thirty docs, and arranges them according to the requested
sort order, returning only the top 10.  In this case, there happen to be
no duplicates.

With rows=100, you get a total of 300 docs.  This time, duplicates are
found and removed by the coordinator.  I think that the coordinator
adjusts the total numFound by the number of duplicate documents it
removed, in an attempt to be more accurate.

I don't know if adjusting numFound when duplicates are found in a
sharded query is the right thing to do, I'll leave that for smarter
people.  Perhaps Solr should return a message with the results saying
that duplicates were found, and if a config option is not enabled, the
server should throw an exception and return a 4xx HTTP error code.  One
idea for a config parameter name would be allowShardDuplicates, but
something better can probably be found.

Thanks,
Shawn