You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Robert Bart <rb...@cs.washington.edu> on 2011/12/22 03:45:56 UTC

Retrieving large numbers of documents from several disks in parallel

Hi All,


I am running Lucene 3.4 in an application that indexes about 1 billion
factual assertions (Documents) from the web over four separate disks, so
that each disk has a separate index of about 250 million documents. The
Documents are relatively small, less than 1KB each. These indexes provide
data to our web demo (http://openie.cs.washington.edu), where a typical
search needs to retrieve and materialize as many as 3,000 Documents from
each index in order to display a page of results to the user.


In the worst case, a new, uncached query takes around 30 seconds to
complete, with all four disks IO bottlenecked during most of this time. My
implementation uses a separate Thread per disk to (1) call
IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
a long time to retrieve 3,000 small Documents, I am wondering if I am
overlooking something simple somewhere.


Is there a better method for retrieving documents in bulk?


Is there a better way of parallelizing indexes from separate disks than to
use a MultiReader (which doesn’t seem to parallelize the task of
materializing Documents)


Any other suggestions? I have tried some of the basic ideas on the Lucene
wiki, such as leaving the IndexSearcher open for the life of the process (a
servlet). Any help would be greatly appreciated!


Rob

Re: Retrieving large numbers of documents from several disks in parallel

Posted by Erick Erickson <er...@gmail.com>.

I'll take your word for it, though it seems odd. I'm wondering
if there's anything you can do to pre-process the documents
at index time to make the post-processing less painful, but
that's a wild shot in the dark...

Another possibility would be to fetch only the fields you need
to do the post-processing, then go back to the servers for the
full documents, something like q=id:(1 2 3 4 5 6 .....), assuming
that your post-processing reduces to a reasonable number
of documents. I have no real clue whether you could reduce
the data returned to few enough fields to really make a difference,
but it might be a possibility.

But I think you're between a rock and a hard place. If you can't
push your processing down to the search, and you really *do*
have to load all those docs, you will have painful queries.
Have you thought of an SSD?

Best
Erick

On Tue, Dec 27, 2011 at 9:14 PM, Robert Bart <rb...@cs.washington.edu> wrote:
> Erick,
>
> Thanks for your reply! You are probably right to question how many
> Documents we are retrieving. We know it isn't best, but significantly
> reducing that number will require us to completely rebuild our system.
> Before we do that, we were just wondering if there was anything in the
> Lucene API or elsewhere that we were missing that might be able to help
> out.
>
> Search times for a single document from each index seem to run about 100ms.
> For normal queries (e.g. 3K Documents) the entire search time is spent
> loading documents (e.g. going from Fields to Strings). Our post-processing
> takes < 1s, and is done only after all docs are loaded.
>
> The problem with our current setup is that we need to retrieve a large
> sample of Documents and aggregate them in certain ways before we can tell
> which ones we really needed and which ones we didn't. This doesn't sound
> like something a Filter would be able to do, so I'm not sure how well we
> could push this down into the search process. Avoiding this aggregation
> step is what would require us to completely redesign the system - but maybe
> thats just what we'll have to do.
>
> Anyways, thanks for your help! Any other suggestions would be appreciated,
> but if there is no (relatively) easy solution, thats ok.
>
> Rob
>
> On Thu, Dec 22, 2011 at 4:51 AM, Erick Erickson <er...@gmail.com>wrote:
>
>> I call into question why you "retrieve and materialize as
>> many as 3,000 Documents from each index in order to
>> display a page of results to the user". You have to be
>> doing some post-processing because displaying
>> 12,000 documents to the user is completely useless.
>>
>> I wonder if this is an "XY" problem, see:
>> http://people.apache.org/~hossman/#xyproblem.
>>
>> You're seeking all over the each disk for 3,000
>> documents, which will take time no matter what.
>> Especially if you're loading a bunch of fields.
>>
>> So let's back up a bit and ask why you think you need
>> all those documents? Is it something you could push
>> down into the search process?
>>
>> Also, 250M docs/index is a lot of docs. Before continuing,
>> it would be useful to know your raw search performance if
>> you, say, fetched 1 document from each partition, keeping
>> in mind Lance's comments that the first searches load
>> up a bunch of caches and will be slow. And, as he says,
>> you can get around with autowarming.
>>
>> But before going there, let's understand the root of the problem.
>> Is it search speed or just loading all those documents and
>> then doing your post-processing?
>>
>>
>> Best
>> Erick
>>
>> On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <go...@gmail.com> wrote:
>> > Is each index optimized?
>> >
>> > From my vague grasp of Lucene file formats, I think you want to sort
>> > the documents by segment document id, which is the order of documents
>> > on the disk. This lets you materialize documents in their order on the
>> > disk.
>> >
>> > Solr (and other apps) generally use a separate thread per task and
>> > separate index reading classes (not sure which any more).
>> >
>> > As to the cold-start, how many terms are there? You are loading them
>> > into the field cache, right? Solr has a feature called "auto-warming"
>> > which automatically runs common queries each time it reopens an index.
>> >
>> > On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <pa...@hoplahup.net>
>> wrote:
>> >> Michael,
>> >>
>> >> from a physical point of view, it would seem like the order in which
>> the documents are read is very significant for the reading speed (feel the
>> random access jump as being the issue).
>> >>
>> >> You could:
>> >> - move to ram-disk or ssd to make a difference?
>> >> - use something different than a searcher which might be doing it
>> better (pure speculation: does a hit-collector make a difference?)
>> >>
>> >> hope it helps.
>> >>
>> >> paul
>> >>
>> >>
>> >> Le 22 déc. 2011 à 03:45, Robert Bart a écrit :
>> >>
>> >>> Hi All,
>> >>>
>> >>>
>> >>> I am running Lucene 3.4 in an application that indexes about 1 billion
>> >>> factual assertions (Documents) from the web over four separate disks,
>> so
>> >>> that each disk has a separate index of about 250 million documents. The
>> >>> Documents are relatively small, less than 1KB each. These indexes
>> provide
>> >>> data to our web demo (http://openie.cs.washington.edu), where a
>> typical
>> >>> search needs to retrieve and materialize as many as 3,000 Documents
>> from
>> >>> each index in order to display a page of results to the user.
>> >>>
>> >>>
>> >>> In the worst case, a new, uncached query takes around 30 seconds to
>> >>> complete, with all four disks IO bottlenecked during most of this
>> time. My
>> >>> implementation uses a separate Thread per disk to (1) call
>> >>> IndexSearcher.search(Query query, Filter filter, int n) and (2)
>> process the
>> >>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems
>> like
>> >>> a long time to retrieve 3,000 small Documents, I am wondering if I am
>> >>> overlooking something simple somewhere.
>> >>>
>> >>>
>> >>> Is there a better method for retrieving documents in bulk?
>> >>>
>> >>>
>> >>> Is there a better way of parallelizing indexes from separate disks
>> than to
>> >>> use a MultiReader (which doesn’t seem to parallelize the task of
>> >>> materializing Documents)
>> >>>
>> >>>
>> >>> Any other suggestions? I have tried some of the basic ideas on the
>> Lucene
>> >>> wiki, such as leaving the IndexSearcher open for the life of the
>> process (a
>> >>> servlet). Any help would be greatly appreciated!
>> >>>
>> >>>
>> >>> Rob
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: java-user-help@lucene.apache.org
>> >>
>> >
>> >
>> >
>> > --
>> > Lance Norskog
>> > goksron@gmail.com
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
> --
> Rob

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Retrieving large numbers of documents from several disks in parallel

Posted by Robert Bart <rb...@cs.washington.edu>.

Erick,

Thanks for your reply! You are probably right to question how many
Documents we are retrieving. We know it isn't best, but significantly
reducing that number will require us to completely rebuild our system.
Before we do that, we were just wondering if there was anything in the
Lucene API or elsewhere that we were missing that might be able to help
out.

Search times for a single document from each index seem to run about 100ms.
For normal queries (e.g. 3K Documents) the entire search time is spent
loading documents (e.g. going from Fields to Strings). Our post-processing
takes < 1s, and is done only after all docs are loaded.

The problem with our current setup is that we need to retrieve a large
sample of Documents and aggregate them in certain ways before we can tell
which ones we really needed and which ones we didn't. This doesn't sound
like something a Filter would be able to do, so I'm not sure how well we
could push this down into the search process. Avoiding this aggregation
step is what would require us to completely redesign the system - but maybe
thats just what we'll have to do.

Anyways, thanks for your help! Any other suggestions would be appreciated,
but if there is no (relatively) easy solution, thats ok.

Rob

On Thu, Dec 22, 2011 at 4:51 AM, Erick Erickson <er...@gmail.com>wrote:

> I call into question why you "retrieve and materialize as
> many as 3,000 Documents from each index in order to
> display a page of results to the user". You have to be
> doing some post-processing because displaying
> 12,000 documents to the user is completely useless.
>
> I wonder if this is an "XY" problem, see:
> http://people.apache.org/~hossman/#xyproblem.
>
> You're seeking all over the each disk for 3,000
> documents, which will take time no matter what.
> Especially if you're loading a bunch of fields.
>
> So let's back up a bit and ask why you think you need
> all those documents? Is it something you could push
> down into the search process?
>
> Also, 250M docs/index is a lot of docs. Before continuing,
> it would be useful to know your raw search performance if
> you, say, fetched 1 document from each partition, keeping
> in mind Lance's comments that the first searches load
> up a bunch of caches and will be slow. And, as he says,
> you can get around with autowarming.
>
> But before going there, let's understand the root of the problem.
> Is it search speed or just loading all those documents and
> then doing your post-processing?
>
>
> Best
> Erick
>
> On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <go...@gmail.com> wrote:
> > Is each index optimized?
> >
> > From my vague grasp of Lucene file formats, I think you want to sort
> > the documents by segment document id, which is the order of documents
> > on the disk. This lets you materialize documents in their order on the
> > disk.
> >
> > Solr (and other apps) generally use a separate thread per task and
> > separate index reading classes (not sure which any more).
> >
> > As to the cold-start, how many terms are there? You are loading them
> > into the field cache, right? Solr has a feature called "auto-warming"
> > which automatically runs common queries each time it reopens an index.
> >
> > On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <pa...@hoplahup.net>
> wrote:
> >> Michael,
> >>
> >> from a physical point of view, it would seem like the order in which
> the documents are read is very significant for the reading speed (feel the
> random access jump as being the issue).
> >>
> >> You could:
> >> - move to ram-disk or ssd to make a difference?
> >> - use something different than a searcher which might be doing it
> better (pure speculation: does a hit-collector make a difference?)
> >>
> >> hope it helps.
> >>
> >> paul
> >>
> >>
> >> Le 22 déc. 2011 à 03:45, Robert Bart a écrit :
> >>
> >>> Hi All,
> >>>
> >>>
> >>> I am running Lucene 3.4 in an application that indexes about 1 billion
> >>> factual assertions (Documents) from the web over four separate disks,
> so
> >>> that each disk has a separate index of about 250 million documents. The
> >>> Documents are relatively small, less than 1KB each. These indexes
> provide
> >>> data to our web demo (http://openie.cs.washington.edu), where a
> typical
> >>> search needs to retrieve and materialize as many as 3,000 Documents
> from
> >>> each index in order to display a page of results to the user.
> >>>
> >>>
> >>> In the worst case, a new, uncached query takes around 30 seconds to
> >>> complete, with all four disks IO bottlenecked during most of this
> time. My
> >>> implementation uses a separate Thread per disk to (1) call
> >>> IndexSearcher.search(Query query, Filter filter, int n) and (2)
> process the
> >>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems
> like
> >>> a long time to retrieve 3,000 small Documents, I am wondering if I am
> >>> overlooking something simple somewhere.
> >>>
> >>>
> >>> Is there a better method for retrieving documents in bulk?
> >>>
> >>>
> >>> Is there a better way of parallelizing indexes from separate disks
> than to
> >>> use a MultiReader (which doesn’t seem to parallelize the task of
> >>> materializing Documents)
> >>>
> >>>
> >>> Any other suggestions? I have tried some of the basic ideas on the
> Lucene
> >>> wiki, such as leaving the IndexSearcher open for the life of the
> process (a
> >>> servlet). Any help would be greatly appreciated!
> >>>
> >>>
> >>> Rob
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> >
> >
> > --
> > Lance Norskog
> > goksron@gmail.com
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Rob

Re: Retrieving large numbers of documents from several disks in parallel

Posted by Erick Erickson <er...@gmail.com>.

I call into question why you "retrieve and materialize as
many as 3,000 Documents from each index in order to
display a page of results to the user". You have to be
doing some post-processing because displaying
12,000 documents to the user is completely useless.

I wonder if this is an "XY" problem, see:
http://people.apache.org/~hossman/#xyproblem.

You're seeking all over the each disk for 3,000
documents, which will take time no matter what.
Especially if you're loading a bunch of fields.

So let's back up a bit and ask why you think you need
all those documents? Is it something you could push
down into the search process?

Also, 250M docs/index is a lot of docs. Before continuing,
it would be useful to know your raw search performance if
you, say, fetched 1 document from each partition, keeping
in mind Lance's comments that the first searches load
up a bunch of caches and will be slow. And, as he says,
you can get around with autowarming.

But before going there, let's understand the root of the problem.
Is it search speed or just loading all those documents and
then doing your post-processing?


Best
Erick

On Thu, Dec 22, 2011 at 3:16 AM, Lance Norskog <go...@gmail.com> wrote:
> Is each index optimized?
>
> From my vague grasp of Lucene file formats, I think you want to sort
> the documents by segment document id, which is the order of documents
> on the disk. This lets you materialize documents in their order on the
> disk.
>
> Solr (and other apps) generally use a separate thread per task and
> separate index reading classes (not sure which any more).
>
> As to the cold-start, how many terms are there? You are loading them
> into the field cache, right? Solr has a feature called "auto-warming"
> which automatically runs common queries each time it reopens an index.
>
> On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
>> Michael,
>>
>> from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue).
>>
>> You could:
>> - move to ram-disk or ssd to make a difference?
>> - use something different than a searcher which might be doing it better (pure speculation: does a hit-collector make a difference?)
>>
>> hope it helps.
>>
>> paul
>>
>>
>> Le 22 déc. 2011 à 03:45, Robert Bart a écrit :
>>
>>> Hi All,
>>>
>>>
>>> I am running Lucene 3.4 in an application that indexes about 1 billion
>>> factual assertions (Documents) from the web over four separate disks, so
>>> that each disk has a separate index of about 250 million documents. The
>>> Documents are relatively small, less than 1KB each. These indexes provide
>>> data to our web demo (http://openie.cs.washington.edu), where a typical
>>> search needs to retrieve and materialize as many as 3,000 Documents from
>>> each index in order to display a page of results to the user.
>>>
>>>
>>> In the worst case, a new, uncached query takes around 30 seconds to
>>> complete, with all four disks IO bottlenecked during most of this time. My
>>> implementation uses a separate Thread per disk to (1) call
>>> IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
>>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
>>> a long time to retrieve 3,000 small Documents, I am wondering if I am
>>> overlooking something simple somewhere.
>>>
>>>
>>> Is there a better method for retrieving documents in bulk?
>>>
>>>
>>> Is there a better way of parallelizing indexes from separate disks than to
>>> use a MultiReader (which doesn’t seem to parallelize the task of
>>> materializing Documents)
>>>
>>>
>>> Any other suggestions? I have tried some of the basic ideas on the Lucene
>>> wiki, such as leaving the IndexSearcher open for the life of the process (a
>>> servlet). Any help would be greatly appreciated!
>>>
>>>
>>> Rob
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Retrieving large numbers of documents from several disks in parallel

Posted by Lance Norskog <go...@gmail.com>.

Is each index optimized?

>From my vague grasp of Lucene file formats, I think you want to sort
the documents by segment document id, which is the order of documents
on the disk. This lets you materialize documents in their order on the
disk.

Solr (and other apps) generally use a separate thread per task and
separate index reading classes (not sure which any more).

As to the cold-start, how many terms are there? You are loading them
into the field cache, right? Solr has a feature called "auto-warming"
which automatically runs common queries each time it reopens an index.

On Wed, Dec 21, 2011 at 11:11 PM, Paul Libbrecht <pa...@hoplahup.net> wrote:
> Michael,
>
> from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue).
>
> You could:
> - move to ram-disk or ssd to make a difference?
> - use something different than a searcher which might be doing it better (pure speculation: does a hit-collector make a difference?)
>
> hope it helps.
>
> paul
>
>
> Le 22 déc. 2011 à 03:45, Robert Bart a écrit :
>
>> Hi All,
>>
>>
>> I am running Lucene 3.4 in an application that indexes about 1 billion
>> factual assertions (Documents) from the web over four separate disks, so
>> that each disk has a separate index of about 250 million documents. The
>> Documents are relatively small, less than 1KB each. These indexes provide
>> data to our web demo (http://openie.cs.washington.edu), where a typical
>> search needs to retrieve and materialize as many as 3,000 Documents from
>> each index in order to display a page of results to the user.
>>
>>
>> In the worst case, a new, uncached query takes around 30 seconds to
>> complete, with all four disks IO bottlenecked during most of this time. My
>> implementation uses a separate Thread per disk to (1) call
>> IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
>> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
>> a long time to retrieve 3,000 small Documents, I am wondering if I am
>> overlooking something simple somewhere.
>>
>>
>> Is there a better method for retrieving documents in bulk?
>>
>>
>> Is there a better way of parallelizing indexes from separate disks than to
>> use a MultiReader (which doesn’t seem to parallelize the task of
>> materializing Documents)
>>
>>
>> Any other suggestions? I have tried some of the basic ideas on the Lucene
>> wiki, such as leaving the IndexSearcher open for the life of the process (a
>> servlet). Any help would be greatly appreciated!
>>
>>
>> Rob
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



-- 
Lance Norskog
goksron@gmail.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Retrieving large numbers of documents from several disks in parallel

Posted by Paul Libbrecht <pa...@hoplahup.net>.

Michael,

from a physical point of view, it would seem like the order in which the documents are read is very significant for the reading speed (feel the random access jump as being the issue).

You could:
- move to ram-disk or ssd to make a difference?
- use something different than a searcher which might be doing it better (pure speculation: does a hit-collector make a difference?)

hope it helps.

paul


Le 22 déc. 2011 à 03:45, Robert Bart a écrit :

> Hi All,
> 
> 
> I am running Lucene 3.4 in an application that indexes about 1 billion
> factual assertions (Documents) from the web over four separate disks, so
> that each disk has a separate index of about 250 million documents. The
> Documents are relatively small, less than 1KB each. These indexes provide
> data to our web demo (http://openie.cs.washington.edu), where a typical
> search needs to retrieve and materialize as many as 3,000 Documents from
> each index in order to display a page of results to the user.
> 
> 
> In the worst case, a new, uncached query takes around 30 seconds to
> complete, with all four disks IO bottlenecked during most of this time. My
> implementation uses a separate Thread per disk to (1) call
> IndexSearcher.search(Query query, Filter filter, int n) and (2) process the
> Documents returned from IndexSearcher.doc(int). Since 30 seconds seems like
> a long time to retrieve 3,000 small Documents, I am wondering if I am
> overlooking something simple somewhere.
> 
> 
> Is there a better method for retrieving documents in bulk?
> 
> 
> Is there a better way of parallelizing indexes from separate disks than to
> use a MultiReader (which doesn’t seem to parallelize the task of
> materializing Documents)
> 
> 
> Any other suggestions? I have tried some of the basic ideas on the Lucene
> wiki, such as leaving the IndexSearcher open for the life of the process (a
> servlet). Any help would be greatly appreciated!
> 
> 
> Rob


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org