You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2016/07/19 23:22:20 UTC

[jira] [Updated] (SOLR-6810) Faster searching limited but high rows across many shards all with many hits

     [ https://issues.apache.org/jira/browse/SOLR-6810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Erick Erickson updated SOLR-6810:
---------------------------------
    Attachment: SOLR-6810-hack-eoe.patch

SOLR-8220 does NOT resolve this, but I think it lays the groundwork for a much smaller implementation.

I've attached a patch that is a PoC, note there are //nocommits where I write to system.out from CopmressingStoredFieldsReader just for easy verification that we're decompressing or not....

Also see the nocommit in DocsStreamer. To make this work you need to define your id field as stored=false, dv=true. I don't think I understand useDocValuesAsStored, because setting stored=true useDocValuesAsStored=true still gets the stored field, I'll have to figure that out.

I'm sure this isn't an optimal implementation, but maybe it'll prompt some more carefully thought-out approaches.

Mostly putting this up for comment, I'm probably not going to pursue this in the near future though.

> Faster searching limited but high rows across many shards all with many hits
> ----------------------------------------------------------------------------
>
>                 Key: SOLR-6810
>                 URL: https://issues.apache.org/jira/browse/SOLR-6810
>             Project: Solr
>          Issue Type: Improvement
>          Components: search
>            Reporter: Per Steffensen
>            Assignee: Shalin Shekhar Mangar
>              Labels: distributed_search, performance
>         Attachments: SOLR-6810-hack-eoe.patch, SOLR-6810-trunk.patch, SOLR-6810-trunk.patch, SOLR-6810-trunk.patch, branch_5x_rev1642874.patch, branch_5x_rev1642874.patch, branch_5x_rev1645549.patch
>
>
> Searching "limited but high rows across many shards all with many hits" is slow
> E.g.
> * Query from outside client: q=something&rows=1000
> * Resulting in sub-requests to each shard something a-la this
> ** 1) q=something&rows=1000&fl=id,score
> ** 2) Request the full documents with ids in the global-top-1000 found among the top-1000 from each shard
> What does the subject mean
> * "limited but high rows" means 1000 in the example above
> * "many shards" means 200-1000 in our case
> * "all with many hits" means that each of the shards have a significant number of hits on the query
> The problem grows on all three factors above
> Doing such a query on our system takes between 5 min to 1 hour - depending on a lot of things. It ought to be much faster, so lets make it.
> Profiling show that the problem is that it takes lots of time to access the store to get id’s for (up to) 1000 docs (value of rows parameter) per shard. Having 1000 shards its up to 1 mio ids that has to be fetched. There is really no good reason to ever read information from store for more than the overall top-1000 documents, that has to be returned to the client.
> For further detail see mail-thread "Slow searching limited but high rows across many shards all with high hits" started 13/11-2014 on dev@lucene.apache.org



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org