You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Brian Hurt <bh...@gmail.com> on 2013/03/18 23:08:26 UTC

Help getting a document by unique ID

So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian

Re: Help getting a document by unique ID

Posted by Erik Hatcher <er...@gmail.com>.

Can you send us what you're trying?   It definitely should not be slow.   Do you have a lot or large stored fields that you're trying to retrieve?

Unless you're doing the transaction log / near-real-time stuff, here's how I'd get a document by id:

     /select?q={!term f=id}<document id>

The reason the {!term} stuff is there is to avoid any query parsing escaping madness.  If you simply did q=id:<id> you'd have to deal with escaping and other query parser headaches.

	Erik

On Mar 18, 2013, at 15:08 , Brian Hurt wrote:

> So here's the problem I'm trying to solve: in my use case, all my
> documents have a unique id associated with them (a string), and I very
> often need to get them by id.  Currently I'm doing a search on id, and
> this takes long enough it's killing my performance.  Now, it looks
> like there is a GET call in the REST interface which does exactly what
> I need, but I'm using the solrj interface.
> 
> So my two questions are:
> 
> 1. Is GET the right function I should be using?  Or should I be using
> some other function, or storing copies of the documents some where
> else entirely for fast id-based retrieval?
> 
> 2. How do I call GET with solrj?  I've googled for how to do this, and
> haven't come up with anything.
> 
> Thanks.
> 
> Brian

Re: Help getting a document by unique ID

Posted by Stefan Matheis <ma...@gmail.com>.

It's a bit off-topic, but .. to mention it, because Brian said Graph Database -- Neo4J uses / can use Lucene .. so, dependent of the usecase this is worth a look?



On Tuesday, March 19, 2013 at 10:11 PM, Shawn Heisey wrote:

> On 3/19/2013 2:31 PM, Brian Hurt wrote:
> > Which is the problem- you might think that 60ms unique key accesses
> > (what I'm seeing) is more than good enough- and for most use cases,
> > you'd be right. But it's not unusual for a single web-page hit to
> > generate many dozens, if not low hundreds, of calls to get document by
> > id. At which point, 60ms hits pile up fast.
> 
> 
> 
> I have to concur with Jack's assessment that 60ms may indicate a general 
> performance issue, possibly caused by not having enough memory in your 
> server.
> 
> I've got a distributed index with 77 million documents in it, seven 
> shards, total index size about 85GB. It's running 4.2.
> 
> I tried some uncached unique id queries on it. This search kicks off 
> seven shard searches against two servers, collates the results, then 
> returns them to the browser. The results came back with a QTime of 7-8 
> milliseconds. When I try a different uncached query against one of the 
> shard servers directly (14GB index size), the QTime value is zero.
> 
> I have this performance level because I have plenty of extra RAM, which 
> lets the OS cache the index files effectively. Each server has half the 
> index (over 40GB on disk) and 64GB of RAM. Of that 64GB, 6GB is 
> allocated to Solr. If we say the OS takes up 1GB (which it most likely 
> does not), that leaves 57GB of OS disk cache. Java's garbage collector 
> is highly tuned in my setup, because without it, I experience very long 
> GC pauses.
> 
> 
> Here's some additional info that may or may not be useful to you:
> 
> The BloomFilter postings format for Lucene is rumored to have amazing 
> performance improvements for searching unique keys.
> 
> An obstacle: Solr does not currently have an out-of-the-box way to 
> actually use it. A high-level solution has been proposed, but no code 
> has been written yet. The following issue describes the current state:
> 
> https://issues.apache.org/jira/browse/SOLR-3950
> 
> You could always write your own custom postings format instead of 
> waiting for someone (most likely me) to figure out how to go about 
> including it directly in Solr. If you do this, I hope you'll be able to 
> attach your code to the issue so everyone benefits.
> 
> Thanks,
> Shawn

Re: Help getting a document by unique ID

Posted by Shawn Heisey <so...@elyograg.org>.

On 3/19/2013 2:31 PM, Brian Hurt wrote:
> Which is the problem- you might think that 60ms unique key accesses
> (what I'm seeing) is more than good enough- and for most use cases,
> you'd be right.  But it's not unusual for a single web-page hit to
> generate many dozens, if not low hundreds, of calls to get document by
> id.  At which point, 60ms hits pile up fast.

I have to concur with Jack's assessment that 60ms may indicate a general 
performance issue, possibly caused by not having enough memory in your 
server.

I've got a distributed index with 77 million documents in it, seven 
shards, total index size about 85GB.  It's running 4.2.

I tried some uncached unique id queries on it.  This search kicks off 
seven shard searches against two servers, collates the results, then 
returns them to the browser.  The results came back with a QTime of 7-8 
milliseconds.  When I try a different uncached query against one of the 
shard servers directly (14GB index size), the QTime value is zero.

I have this performance level because I have plenty of extra RAM, which 
lets the OS cache the index files effectively.  Each server has half the 
index (over 40GB on disk) and 64GB of RAM.  Of that 64GB, 6GB is 
allocated to Solr.  If we say the OS takes up 1GB (which it most likely 
does not), that leaves 57GB of OS disk cache.  Java's garbage collector 
is highly tuned in my setup, because without it, I experience very long 
GC pauses.

Here's some additional info that may or may not be useful to you:

The BloomFilter postings format for Lucene is rumored to have amazing 
performance improvements for searching unique keys.

An obstacle: Solr does not currently have an out-of-the-box way to 
actually use it.  A high-level solution has been proposed, but no code 
has been written yet.  The following issue describes the current state:

https://issues.apache.org/jira/browse/SOLR-3950

You could always write your own custom postings format instead of 
waiting for someone (most likely me) to figure out how to go about 
including it directly in Solr.  If you do this, I hope you'll be able to 
attach your code to the issue so everyone benefits.

Thanks,
Shawn

Re: Help getting a document by unique ID

Posted by Jack Krupansky <ja...@basetechnology.com>.

60ms does seem excessive for the simplest possible access - lookup by the 
unique key field value. SOMETHING is clearly unacceptable at that level. Is 
this on decent hardware?

Try a query with &debugQuery=true and look at the "timing" section and see 
what component(s) are eating up the lion's share of that 60 ms. Is it the 
query component or something else like faceting or highlighting?

Or, are you returning a lot of field values?

Or, are you using a lot of filters that are relatively unique (and hence 
frequently recomputed)?

Are you doing a lot of updating while querying (and hence invalidating 
caches)?

-- Jack Krupansky

-----Original Message----- 
From: Brian Hurt
Sent: Tuesday, March 19, 2013 4:31 PM
To: solr-user@lucene.apache.org
Subject: Re: Help getting a document by unique ID

On Mon, Mar 18, 2013 at 7:08 PM, Jack Krupansky <ja...@basetechnology.com> 
wrote:
> Hmmm... if query by your unique key field is killing your performance, 
> maybe
> you have some larger problem to address.

This is almost certainly true.  I'm well outside the use cases
targeted by Solr/Lucene, and it's a testament to the quality of the
product that it works at all.  Among other things, I'm implementing a
graph database on top of Solr (it being easier to build a graph
database on top of Solr than it is to implement Solr on top of a graph
database).

Which is the problem- you might think that 60ms unique key accesses
(what I'm seeing) is more than good enough- and for most use cases,
you'd be right.  But it's not unusual for a single web-page hit to
generate many dozens, if not low hundreds, of calls to get document by
id.  At which point, 60ms hits pile up fast.

The current plan is to just cache the documents as files in the local
file system (or possibly other systems), and have the get document
calls go there instead, while complicated searches still go to Solr.
Fortunately, this isn't complicated.

> How bad is it? Are you using the
> string field type? How long are your ids?

My ids start at 100 million and go up like a kite from there- thus the
string representation.

>
> The only thing the real-time GET API gives you is more immediate access to
> recently added, uncommitted data. Accessing older, committed data will be 
> no
> faster. But if accessing that recent data is what you are after, real-time
> GET may do the trick.

OK, so this is good to know.  This answers question #1: GET isn't the
function I should be calling.  Thanks.

Brian

Re: Help getting a document by unique ID

Posted by Chris Hostetter <ho...@fucit.org>.

: Which is the problem- you might think that 60ms unique key accesses
: (what I'm seeing) is more than good enough- and for most use cases,
: you'd be right.  But it's not unusual for a single web-page hit to
: generate many dozens, if not low hundreds, of calls to get document by
: id.  At which point, 60ms hits pile up fast.

1) if you currently have lazy filed loading enabled in your 
ssolrconfig.xml, try turning it off and see if it speeds up your tests...
https://issues.apache.org/jira/browse/SOLR-4589

2) is your webapp doing these sequentially in a single thread? why not ask 
for all of the docs you need in a single query? (a few hundred query 
clauses isn't a big deal).  how are the list of IDs to fetch picked?  what 
do you do with the data in your webapp?

https://people.apache.org/~hossman/#xyproblem
XY Problem

Your question appears to be an "XY Problem" ... that is: you are dealing
with "X", you are assuming "Y" will help you, and you are asking about "Y"
without giving more details about the "X" so that we can understand the
full issue.  Perhaps the best solution doesn't involve "Y" at all?
See Also: http://www.perlmonks.org/index.pl?node_id=542341


-Hoss

Re: Help getting a document by unique ID

Posted by Brian Hurt <bh...@gmail.com>.

On Mon, Mar 18, 2013 at 7:08 PM, Jack Krupansky <ja...@basetechnology.com> wrote:
> Hmmm... if query by your unique key field is killing your performance, maybe
> you have some larger problem to address.

This is almost certainly true.  I'm well outside the use cases
targeted by Solr/Lucene, and it's a testament to the quality of the
product that it works at all.  Among other things, I'm implementing a
graph database on top of Solr (it being easier to build a graph
database on top of Solr than it is to implement Solr on top of a graph
database).

Which is the problem- you might think that 60ms unique key accesses
(what I'm seeing) is more than good enough- and for most use cases,
you'd be right.  But it's not unusual for a single web-page hit to
generate many dozens, if not low hundreds, of calls to get document by
id.  At which point, 60ms hits pile up fast.

The current plan is to just cache the documents as files in the local
file system (or possibly other systems), and have the get document
calls go there instead, while complicated searches still go to Solr.
Fortunately, this isn't complicated.

> How bad is it? Are you using the
> string field type? How long are your ids?

My ids start at 100 million and go up like a kite from there- thus the
string representation.

>
> The only thing the real-time GET API gives you is more immediate access to
> recently added, uncommitted data. Accessing older, committed data will be no
> faster. But if accessing that recent data is what you are after, real-time
> GET may do the trick.

OK, so this is good to know.  This answers question #1: GET isn't the
function I should be calling.  Thanks.

Brian

Re: Help getting a document by unique ID

Posted by Jack Krupansky <ja...@basetechnology.com>.

Hmmm... if query by your unique key field is killing your performance, maybe 
you have some larger problem to address. How bad is it? Are you using the 
string field type? How long are your ids?

The only thing the real-time GET API gives you is more immediate access to 
recently added, uncommitted data. Accessing older, committed data will be no 
faster. But if accessing that recent data is what you are after, real-time 
GET may do the trick.

I don't recall seeing changes to add it to SolrJ.

Realtime Get:
http://searchhub.org/2011/09/07/realtime-get/

-- Jack Krupansky

-----Original Message----- 
From: Brian Hurt
Sent: Monday, March 18, 2013 6:08 PM
To: solr-user@lucene.apache.org
Subject: Help getting a document by unique ID

So here's the problem I'm trying to solve: in my use case, all my
documents have a unique id associated with them (a string), and I very
often need to get them by id.  Currently I'm doing a search on id, and
this takes long enough it's killing my performance.  Now, it looks
like there is a GET call in the REST interface which does exactly what
I need, but I'm using the solrj interface.

So my two questions are:

1. Is GET the right function I should be using?  Or should I be using
some other function, or storing copies of the documents some where
else entirely for fast id-based retrieval?

2. How do I call GET with solrj?  I've googled for how to do this, and
haven't come up with anything.

Thanks.

Brian