You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by deniz <de...@gmail.com> on 2012/11/30 04:49:48 UTC

SolrCloud - Sorting Problem

Hello, I am having a weird problem with solrcloud and sorting, I will open a
bug ticket about this too, but wondering if anyone had similar problems like
mine

Background: Basically, I have added a new feature to Solr after I got the
source code. Similar to the we get "score" in the resultset,  I am now able
to get position (or ranking) information of each document in the list. i.e
if there are 5 documents in the result set, each of them has its position
information if you add "fl=*,position" to the query.

Problem: Briefly, when a solr instance is standalone, there is no problem
with sorting and posiiton information of each document, but when the same
solr is on a cloud (as a master), the result set is some kinda shuffled and
position information is incorrect.

So it ls like this:

Both standalone and the on cloud finds the same amount of documents in the
index (say 15000), which is filled by using the same data source. So till
this point everything seems normal

But here are the results

Standalone Solr:

<doc>
       <id>a</id>
       <position>1</position>
</doc>
<doc>
      <id>b</id>
      <position>2</position>
</doc>
<doc>
       <id>c</id>
       <position>3</position>
</doc>
<doc>
      <id>d</id>
      <position>4</position>
</doc>
<doc>
      <id>e</id>
      <position>5</position>
</doc>
<doc>
      <id>f</id>
      <position>6</position>
</doc>

Same Solr on Cloud (as master)

<doc>
        <id>z</id>
        <position>4</position>
</doc>
<doc>
       <id>x</id>
       <position>6</position>
</doc>
<doc>
       <id>y</id>
       <position>1</position>
</doc>
<doc>
       <id>v</id>
       <position>3</position>
</doc>
<doc>
       <id>r</id>
       <position>2</position>
</doc>
<doc>
       <id>o</id>
       <position>5</position>
</doc>


As clear above, the *same configs with the same query and sorting
parameter*, are returning *different documents and totally shuffled
position* information. 


Anyone has any ideas on this?





-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud - Sorting Problem

Posted by varun srivastava <va...@gmail.com>.
Also if anyone who understand DistributedSearch can update following wiki
it will be really helpful for all of us.

http://wiki.apache.org/solr/DistributedSearchDesign

Thanks
Varun

On Sat, Mar 9, 2013 at 4:03 PM, varun srivastava <va...@gmail.com>wrote:

> Hi Deniz,
>  Your mail about distributed query is really helpful. Can you or someone
> else improve the following wiki. RIght now we dont have any document
> explaining distributed search in solr, which is now backbone of solr cloud.
>
> http://wiki.apache.org/solr/WritingDistributedSearchComponents
>
> Thanks
> Varun
>
> On Sun, Dec 2, 2012 at 10:49 PM, deniz <de...@gmail.com> wrote:
>
>> I think I have figured out this... at least some kinda..
>>
>> After putting logs here there in the code, especially in SolrCore,
>> HttpShardHandler, SearchHandler classes, it seems like sorting is done
>> after
>> all of the shards finish "responding" and then before we see the results
>> the
>> result set is sorted... I am not sure if this is correct or not totally,
>> it
>> is what i see from the logs, in the request headers..
>>
>> so for a shard or distributed search the header looks like this:
>>
>> status=0,QTime=4,params={df=text,fl=*,position,shard.url=blablabla
>>
>> and just before i see the results on my browser the header becomes this:
>>
>> status=0,QTime=178,params={fl=*,position,sort=myfield desc
>>
>> and basically, because the position field was filled before actual sorting
>> on the page, the positions are incorrect...
>>
>> is this right? i mean sorting is really done after everything finishes and
>> we are about to get results?
>>
>>
>>
>> -----
>> Zeki ama calismiyor... Calissa yapar...
>> --
>> View this message in context:
>> http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023889.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>

Re: SolrCloud - Sorting Problem

Posted by varun srivastava <va...@gmail.com>.
Hi Deniz,
 Your mail about distributed query is really helpful. Can you or someone
else improve the following wiki. RIght now we dont have any document
explaining distributed search in solr, which is now backbone of solr cloud.

http://wiki.apache.org/solr/WritingDistributedSearchComponents

Thanks
Varun

On Sun, Dec 2, 2012 at 10:49 PM, deniz <de...@gmail.com> wrote:

> I think I have figured out this... at least some kinda..
>
> After putting logs here there in the code, especially in SolrCore,
> HttpShardHandler, SearchHandler classes, it seems like sorting is done
> after
> all of the shards finish "responding" and then before we see the results
> the
> result set is sorted... I am not sure if this is correct or not totally, it
> is what i see from the logs, in the request headers..
>
> so for a shard or distributed search the header looks like this:
>
> status=0,QTime=4,params={df=text,fl=*,position,shard.url=blablabla
>
> and just before i see the results on my browser the header becomes this:
>
> status=0,QTime=178,params={fl=*,position,sort=myfield desc
>
> and basically, because the position field was filled before actual sorting
> on the page, the positions are incorrect...
>
> is this right? i mean sorting is really done after everything finishes and
> we are about to get results?
>
>
>
> -----
> Zeki ama calismiyor... Calissa yapar...
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023889.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: SolrCloud - Sorting Problem

Posted by deniz <de...@gmail.com>.
I think I have figured out this... at least some kinda.. 

After putting logs here there in the code, especially in SolrCore,
HttpShardHandler, SearchHandler classes, it seems like sorting is done after
all of the shards finish "responding" and then before we see the results the
result set is sorted... I am not sure if this is correct or not totally, it
is what i see from the logs, in the request headers..

so for a shard or distributed search the header looks like this:

status=0,QTime=4,params={df=text,fl=*,position,shard.url=blablabla

and just before i see the results on my browser the header becomes this:

status=0,QTime=178,params={fl=*,position,sort=myfield desc

and basically, because the position field was filled before actual sorting
on the page, the positions are incorrect...

is this right? i mean sorting is really done after everything finishes and
we are about to get results? 



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023889.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud - Sorting Problem

Posted by deniz <de...@gmail.com>.
deniz wrote
> after these, I guess i need to check how the request is distributed on
> cloud... any ideas where I should start checking?

as for replying my own question (hopefully correct) I have started digging
org.apache.solr.handler.component.SearchHandler.handleRequestBody which
loops (i couldnt find out exacly why or how, but it is always 3 times )
where it calls my custom method..



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023871.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud - Sorting Problem

Posted by deniz <de...@gmail.com>.
Chris Hostetter-3 wrote
> w/o more information about how/where you add this information, it's going 
> to be really hard to give you suggestions on how to fix your problem.

The modifications I made is nearly the same with score field. Basically I
have added a PositionAugmenter class, modified ReturnFields class and made
some changes on the classes which extends Document Iterator class, to show
positions for each document. It is pretty simple actually, when you make a
query you see some results, sorted by whichever field you choose, and
depending on how you see the result page, there is position information for
each document added by my modifications. 


And thank you for your explanation, as i see, it is pretty much works in the
same way with traditional sharding... but the point which makes me totally
confused is that even if there is a single solr instance in cloud the order
of documents is different and position information is not correct.

So when you make a search on a standalone single solr, you see some
documents sorted in some order. And when you make the same search, with the
same dataset and index in a cloud which has single solr inside, returns
documents in a different order. so basically even without asking for
position information, the order is different between a standalone and an
instance on the cloud. 

besides this, I have made some simple tests to see what was going on. on a
standalone solr, when i make a query, and also add position in fl, my
modifications are called only once, and then i see the results. however, in
cloud, with a single instance, when i run the same query, the same part is
called more than once, usually 3 times (I dont know why?)

and when there are more instances on the cloud, i can see same logs in both
instances, though number of times that i can see the logs differs for each
request, which is normal for cloud with multiple solrs running on it...

after these, I guess i need to check how the request is distributed on
cloud... any ideas where I should start checking? 



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023861.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: SolrCloud - Sorting Problem

Posted by Chris Hostetter <ho...@fucit.org>.
: Background: Basically, I have added a new feature to Solr after I got the
: source code. Similar to the we get "score" in the resultset,  I am now able
: to get position (or ranking) information of each document in the list. i.e
: if there are 5 documents in the result set, each of them has its position
: information if you add "fl=*,position" to the query.

w/o more information about how/where you add this information, it's going 
to be really hard to give you suggestions on how to fix your problem.

: solr is on a cloud (as a master), the result set is some kinda shuffled and
: position information is incorrect.

In general the thing to keep in mind is that when doing a distributed 
query, each node is responsible for providing data about the results, and 
then a single node (which ever one your client/browser is connected to) 
acts as an agregator to merge that information.

How that merging happens is specific to the information, and in most cases 
multiple (pipelined) requests are made to the individual shards.

for example: a request to search for X, sorted by Y, and faceting on field 
Z requires two requests to every shard: the first request gets the 
docIds and value of "Y" for the first N docs from each shard, as well as 
the top ranking facet values and their counts for field Z.  Then the 
agregator looks at the Y values to figure out which docs from which shards 
should be in the final result, and it looks at the facet values to see 
which ones should be in the final result, and then it issues a second 
request to each shard in which it asks for the "fl" of those specific 
docs, and the final counts for those specific facet values, and 
then the final response is built up and returned from the client.

so depending on what you really mean in terms of "position" in an 
agregated request like this, you need to make sure your custom code is 
running in the right place -- that may mean having logic that runs on the 
individual shards, as well as merge logic on the aggregator, or it may 
mean logic thta *only* runs on the aggregator, based on information 
already available.

the details of what you are trying to do, and how you are currently 
attempting to do it, matter a lot.


-Hoss

Re: SolrCloud - Sorting Problem

Posted by deniz <de...@gmail.com>.
After playing with this more, i think I have some clue...

on the standalone solr, when i give start 11 and rows 20, i can see
documents with positions ranging from 12 to 31, which is correct... on the
cloud, when i give the same parameters, again i get the same documents, but
this time position ranges between 1 to 20... 

so my question... cloud uses some different class for responding to the
search request? if so, are there any other ways to find those classes out
other than digging the code? 



-----
Zeki ama calismiyor... Calissa yapar...
--
View this message in context: http://lucene.472066.n3.nabble.com/SolrCloud-Sorting-Problem-tp4023382p4023399.html
Sent from the Solr - User mailing list archive at Nabble.com.