You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Shivaji Dutta <sd...@hortonworks.com> on 2016/01/15 01:20:49 UTC

Solr Query Tuning

Team

Thanks for all the help before.

Current State

I am working with a customer that has about a billion documents on 20 shards. The documents are extremely small about 100 characters each.
The insert rate is pretty good, but they are trying to fetch the document by using SolrJ SolrQuery

Solr Query is taking about 1 min to return.

The query is very simple
id:<documentid>
Note the content of the document is just the documentid.

Request for Information

A) I am looking for some information as how I could go about tuning the query.
B) An alternate approach that I am thinking of is to use the "/get" request handler
Is this going to be faster than "/select"
C) I am looking at the debugQuery option, but I am unsure how to interpret this. I saw an slide share which talked about "http://explain.solr.pl/help", but it only supports older versions of solr.

Any thoughts?

Thanks
Shivaji

Re: Solr Query Tuning

Posted by Jack Krupansky <ja...@gmail.com>.
Sounds intriguing. It would have to know for sure which query parser is
being used, which might be set in the server side defaults.

Over in Cassandra NoSQL database land we have the concept of "token aware
load balancing policy" on the client side that does the necessary magic
(requiring parsing of the query) to send the request to exactly the node
(or replica) that owns that token/ID.

But if you really just trying to "query by ID", that should really have a
nice clean API so you don't have to build query syntax.

-- Jack Krupansky

On Thu, Jan 14, 2016 at 8:41 PM, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> Stupid thought/question. Is there a query by id API that understands
> SolrCloud routing and can simply fwd the query to the shard that would hold
> said document? Barring that, can one use SolrJ's routing brains to see what
> shard a given id would be routed to and only query that shard?
>
> -Doug
>
> On Thursday, January 14, 2016, Jack Krupansky <ja...@gmail.com>
> wrote:
>
> > Add &debug=all to your query to see where the time is spent in the
> "timing"
> > section to see which Solr search component is consuming the time.
> >
> > You may also have to add &debug=track to get the shard-specific info.
> >
> > In theory, 19 of the shards should return nothing and the 20th will
> return
> > a single document.
> >
> > Maybe one of the shard nodes is having trouble and takes way too long to
> do
> > essentially nothing.
> >
> > Does the document ID have any special characters in it? If so, be sure to
> > escape them or put the ID in quotes, otherwise some piece of the ID may
> > match lots of documents, although even that should not be a big problem.
> >
> > And make sure the ID field is string or numeric, not tokenized text.
> >
> >
> > -- Jack Krupansky
> >
> > On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey <apache@elyograg.org
> > <javascript:;>> wrote:
> >
> > > On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> > > > I am working with a customer that has about a billion documents on 20
> > > shards. The documents are extremely small about 100 characters each.
> > > > The insert rate is pretty good, but they are trying to fetch the
> > > document by using SolrJ SolrQuery
> > > >
> > > > Solr Query is taking about 1 min to return.
> > > >
> > > > The query is very simple
> > > > id:<documentid>
> > > > Note the content of the document is just the documentid.
> > > >
> > > > Request for Information
> > > >
> > > > A) I am looking for some information as how I could go about tuning
> the
> > > query.
> > > > B) An alternate approach that I am thinking of is to use the "/get"
> > > request handler
> > > > Is this going to be faster than "/select"
> > > > C) I am looking at the debugQuery option, but I am unsure how to
> > > interpret this. I saw an slide share which talked about "
> > > http://explain.solr.pl/help", but it only supports older versions of
> > solr.
> > >
> > > I have no idea whether /get would be faster.  You'd need to try it.
> > >
> > > Can you provide the SolrJ code that you are using to do the query?
> > > Another useful item would be the entire entry from the Solr logfile for
> > > this query.  There will probably be multiple log entries for one query,
> > > usually the relevant log entry is the last one in the series.  I may
> > > need the schema, but we'll decide that later.
> > >
> > > Are all 20 shards on the same server, or have you got them spread out
> > > across multiple machines?  What is the replicationFactor on the
> > > collection?  If there are multiple machines, how many shards live on
> > > each machine, and how many machines do you have total?  Do you happen
> to
> > > know how large the Lucene index is for each of these shards?  How much
> > > total memory does each server have, and how large is the Java heap?  Is
> > > there software other than Solr running on the machine(s)?
> > >
> > > I am suspecting that you don't have enough memory for the operating
> > > system to effectively cache your index.  Good performance for a billion
> > > documents is going to require a lot of memory and probably a lot of
> > > servers.
> > >
> > > https://wiki.apache.org/solr/SolrPerformanceProblems
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>

Re: Solr Query Tuning

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
I suppose that /get is the query by id API. I wonder if its reasonable to
expect it to be smart in SolrCloud usage.

On Thursday, January 14, 2016, Doug Turnbull <
dturnbull@opensourceconnections.com> wrote:

> Stupid thought/question. Is there a query by id API that understands
> SolrCloud routing and can simply fwd the query to the shard that would hold
> said document? Barring that, can one use SolrJ's routing brains to see what
> shard a given id would be routed to and only query that shard?
>
> -Doug
>
> On Thursday, January 14, 2016, Jack Krupansky <jack.krupansky@gmail.com
> <javascript:_e(%7B%7D,'cvml','jack.krupansky@gmail.com');>> wrote:
>
>> Add &debug=all to your query to see where the time is spent in the
>> "timing"
>> section to see which Solr search component is consuming the time.
>>
>> You may also have to add &debug=track to get the shard-specific info.
>>
>> In theory, 19 of the shards should return nothing and the 20th will return
>> a single document.
>>
>> Maybe one of the shard nodes is having trouble and takes way too long to
>> do
>> essentially nothing.
>>
>> Does the document ID have any special characters in it? If so, be sure to
>> escape them or put the ID in quotes, otherwise some piece of the ID may
>> match lots of documents, although even that should not be a big problem.
>>
>> And make sure the ID field is string or numeric, not tokenized text.
>>
>>
>> -- Jack Krupansky
>>
>> On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey <ap...@elyograg.org>
>> wrote:
>>
>> > On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
>> > > I am working with a customer that has about a billion documents on 20
>> > shards. The documents are extremely small about 100 characters each.
>> > > The insert rate is pretty good, but they are trying to fetch the
>> > document by using SolrJ SolrQuery
>> > >
>> > > Solr Query is taking about 1 min to return.
>> > >
>> > > The query is very simple
>> > > id:<documentid>
>> > > Note the content of the document is just the documentid.
>> > >
>> > > Request for Information
>> > >
>> > > A) I am looking for some information as how I could go about tuning
>> the
>> > query.
>> > > B) An alternate approach that I am thinking of is to use the "/get"
>> > request handler
>> > > Is this going to be faster than "/select"
>> > > C) I am looking at the debugQuery option, but I am unsure how to
>> > interpret this. I saw an slide share which talked about "
>> > http://explain.solr.pl/help", but it only supports older versions of
>> solr.
>> >
>> > I have no idea whether /get would be faster.  You'd need to try it.
>> >
>> > Can you provide the SolrJ code that you are using to do the query?
>> > Another useful item would be the entire entry from the Solr logfile for
>> > this query.  There will probably be multiple log entries for one query,
>> > usually the relevant log entry is the last one in the series.  I may
>> > need the schema, but we'll decide that later.
>> >
>> > Are all 20 shards on the same server, or have you got them spread out
>> > across multiple machines?  What is the replicationFactor on the
>> > collection?  If there are multiple machines, how many shards live on
>> > each machine, and how many machines do you have total?  Do you happen to
>> > know how large the Lucene index is for each of these shards?  How much
>> > total memory does each server have, and how large is the Java heap?  Is
>> > there software other than Solr running on the machine(s)?
>> >
>> > I am suspecting that you don't have enough memory for the operating
>> > system to effectively cache your index.  Good performance for a billion
>> > documents is going to require a lot of memory and probably a lot of
>> > servers.
>> >
>> > https://wiki.apache.org/solr/SolrPerformanceProblems
>> >
>> > Thanks,
>> > Shawn
>> >
>> >
>>
>
>
> --
> *Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
> <http://opensourceconnections.com>, LLC | 240.476.9983
> Author: Relevant Search <http://manning.com/turnbull>
> This e-mail and all contents, including attachments, is considered to be
> Company Confidential unless explicitly stated otherwise, regardless
> of whether attachments are marked as such.
>
>

-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Solr Query Tuning

Posted by Doug Turnbull <dt...@opensourceconnections.com>.
Stupid thought/question. Is there a query by id API that understands
SolrCloud routing and can simply fwd the query to the shard that would hold
said document? Barring that, can one use SolrJ's routing brains to see what
shard a given id would be routed to and only query that shard?

-Doug

On Thursday, January 14, 2016, Jack Krupansky <ja...@gmail.com>
wrote:

> Add &debug=all to your query to see where the time is spent in the "timing"
> section to see which Solr search component is consuming the time.
>
> You may also have to add &debug=track to get the shard-specific info.
>
> In theory, 19 of the shards should return nothing and the 20th will return
> a single document.
>
> Maybe one of the shard nodes is having trouble and takes way too long to do
> essentially nothing.
>
> Does the document ID have any special characters in it? If so, be sure to
> escape them or put the ID in quotes, otherwise some piece of the ID may
> match lots of documents, although even that should not be a big problem.
>
> And make sure the ID field is string or numeric, not tokenized text.
>
>
> -- Jack Krupansky
>
> On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey <apache@elyograg.org
> <javascript:;>> wrote:
>
> > On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> > > I am working with a customer that has about a billion documents on 20
> > shards. The documents are extremely small about 100 characters each.
> > > The insert rate is pretty good, but they are trying to fetch the
> > document by using SolrJ SolrQuery
> > >
> > > Solr Query is taking about 1 min to return.
> > >
> > > The query is very simple
> > > id:<documentid>
> > > Note the content of the document is just the documentid.
> > >
> > > Request for Information
> > >
> > > A) I am looking for some information as how I could go about tuning the
> > query.
> > > B) An alternate approach that I am thinking of is to use the "/get"
> > request handler
> > > Is this going to be faster than "/select"
> > > C) I am looking at the debugQuery option, but I am unsure how to
> > interpret this. I saw an slide share which talked about "
> > http://explain.solr.pl/help", but it only supports older versions of
> solr.
> >
> > I have no idea whether /get would be faster.  You'd need to try it.
> >
> > Can you provide the SolrJ code that you are using to do the query?
> > Another useful item would be the entire entry from the Solr logfile for
> > this query.  There will probably be multiple log entries for one query,
> > usually the relevant log entry is the last one in the series.  I may
> > need the schema, but we'll decide that later.
> >
> > Are all 20 shards on the same server, or have you got them spread out
> > across multiple machines?  What is the replicationFactor on the
> > collection?  If there are multiple machines, how many shards live on
> > each machine, and how many machines do you have total?  Do you happen to
> > know how large the Lucene index is for each of these shards?  How much
> > total memory does each server have, and how large is the Java heap?  Is
> > there software other than Solr running on the machine(s)?
> >
> > I am suspecting that you don't have enough memory for the operating
> > system to effectively cache your index.  Good performance for a billion
> > documents is going to require a lot of memory and probably a lot of
> > servers.
> >
> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >
> > Thanks,
> > Shawn
> >
> >
>


-- 
*Doug Turnbull **| *Search Relevance Consultant | OpenSource Connections
<http://opensourceconnections.com>, LLC | 240.476.9983
Author: Relevant Search <http://manning.com/turnbull>
This e-mail and all contents, including attachments, is considered to be
Company Confidential unless explicitly stated otherwise, regardless
of whether attachments are marked as such.

Re: Solr Query Tuning

Posted by Jack Krupansky <ja...@gmail.com>.
Add &debug=all to your query to see where the time is spent in the "timing"
section to see which Solr search component is consuming the time.

You may also have to add &debug=track to get the shard-specific info.

In theory, 19 of the shards should return nothing and the 20th will return
a single document.

Maybe one of the shard nodes is having trouble and takes way too long to do
essentially nothing.

Does the document ID have any special characters in it? If so, be sure to
escape them or put the ID in quotes, otherwise some piece of the ID may
match lots of documents, although even that should not be a big problem.

And make sure the ID field is string or numeric, not tokenized text.


-- Jack Krupansky

On Thu, Jan 14, 2016 at 7:53 PM, Shawn Heisey <ap...@elyograg.org> wrote:

> On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> > I am working with a customer that has about a billion documents on 20
> shards. The documents are extremely small about 100 characters each.
> > The insert rate is pretty good, but they are trying to fetch the
> document by using SolrJ SolrQuery
> >
> > Solr Query is taking about 1 min to return.
> >
> > The query is very simple
> > id:<documentid>
> > Note the content of the document is just the documentid.
> >
> > Request for Information
> >
> > A) I am looking for some information as how I could go about tuning the
> query.
> > B) An alternate approach that I am thinking of is to use the "/get"
> request handler
> > Is this going to be faster than "/select"
> > C) I am looking at the debugQuery option, but I am unsure how to
> interpret this. I saw an slide share which talked about "
> http://explain.solr.pl/help", but it only supports older versions of solr.
>
> I have no idea whether /get would be faster.  You'd need to try it.
>
> Can you provide the SolrJ code that you are using to do the query?
> Another useful item would be the entire entry from the Solr logfile for
> this query.  There will probably be multiple log entries for one query,
> usually the relevant log entry is the last one in the series.  I may
> need the schema, but we'll decide that later.
>
> Are all 20 shards on the same server, or have you got them spread out
> across multiple machines?  What is the replicationFactor on the
> collection?  If there are multiple machines, how many shards live on
> each machine, and how many machines do you have total?  Do you happen to
> know how large the Lucene index is for each of these shards?  How much
> total memory does each server have, and how large is the Java heap?  Is
> there software other than Solr running on the machine(s)?
>
> I am suspecting that you don't have enough memory for the operating
> system to effectively cache your index.  Good performance for a billion
> documents is going to require a lot of memory and probably a lot of
> servers.
>
> https://wiki.apache.org/solr/SolrPerformanceProblems
>
> Thanks,
> Shawn
>
>

Re: Solr Query Tuning

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/14/2016 5:20 PM, Shivaji Dutta wrote:
> I am working with a customer that has about a billion documents on 20 shards. The documents are extremely small about 100 characters each.
> The insert rate is pretty good, but they are trying to fetch the document by using SolrJ SolrQuery
>
> Solr Query is taking about 1 min to return.
>
> The query is very simple
> id:<documentid>
> Note the content of the document is just the documentid.
>
> Request for Information
>
> A) I am looking for some information as how I could go about tuning the query.
> B) An alternate approach that I am thinking of is to use the "/get" request handler
> Is this going to be faster than "/select"
> C) I am looking at the debugQuery option, but I am unsure how to interpret this. I saw an slide share which talked about "http://explain.solr.pl/help", but it only supports older versions of solr.

I have no idea whether /get would be faster.  You'd need to try it.

Can you provide the SolrJ code that you are using to do the query? 
Another useful item would be the entire entry from the Solr logfile for
this query.  There will probably be multiple log entries for one query,
usually the relevant log entry is the last one in the series.  I may
need the schema, but we'll decide that later.

Are all 20 shards on the same server, or have you got them spread out
across multiple machines?  What is the replicationFactor on the
collection?  If there are multiple machines, how many shards live on
each machine, and how many machines do you have total?  Do you happen to
know how large the Lucene index is for each of these shards?  How much
total memory does each server have, and how large is the Java heap?  Is
there software other than Solr running on the machine(s)?

I am suspecting that you don't have enough memory for the operating
system to effectively cache your index.  Good performance for a billion
documents is going to require a lot of memory and probably a lot of servers.

https://wiki.apache.org/solr/SolrPerformanceProblems

Thanks,
Shawn