You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Jochen Franke <Jo...@jCatalog.com> on 2005/02/25 21:42:30 UTC

Search performance with one index vs. many indexes

Topic: Search performance with large numbers of indexes vs. one large index

Hello,

we are experiencing a performance problem when using large
numbers of indexes.


We have an application with about

6 Mio. Documents
one index of about 7 GB
probably 10 to 15 million different words in that
index.

The creation of the index out of one DB (where the
documents are coming from) with two processor takes about 20 hours.

For several reasons (e.g. parallelizing the index creation), we
created several indexes, by splitting the documents into logical groups.


We first created an artifical benchmark:

10 Mio. Documents
500 Indexes (in about 3 files per index)
10 GB Index alltogether
about 5.000 randomly selected words

Querying this index took about 0.4s per query, so it was only
twice the time than querying index, which was fine for us.

We did the same with one index merged out of the 500 indexes.

The lucene search performance was fine here as well (about 0.2s per 
query on our machine).



We then implemented the "real thing" which is:

6 Mio. Documents
800 Indexes (with about 28 files per index)
about 7 GB index size
probably 10 to 15 million different words in that
index.

We now have a query performance of 4-8 seconds per query.

The test with the real data in one index has not been finished
so far.


My questions are:

- Is the size of the "wordlist" the problem?

- Would we be a lot faster, when we have a smaller number
of files per index?

- Is 500-1000 still a reasonable number of indexes?

- Is there a more or less a linear relationship between
the number of indexes and the execution time of the query
(as all indexes have to be checked and the results have
to be merged)?

- Are there any parameters that could be configured for
that usecase?

- Should we implement any specialized classes specific to our use case?


Thanks,
Jochen Franke


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Indexing sit (stuff it) files

Posted by Luke Shannon <ls...@futurebrand.com>.

Hello;

I've almost completed my zip file indexer. I used the following to get an
InputStream for each file in the archive:

     ZipFile zip = new ZipFile(new File(fileLocation));
            ZipEntry zipEntry;
            Enumeration files = zip.entries();
            while (files.hasMoreElements()) {
                zipEntry = (ZipEntry)files.nextElement();
                //I have conditions here based on
zipEntry.getName().endsWith(".fileExtension) to determine which Document
Handler to use
                //below is the inputstream I send the handler
                InputStream in = zip.getInputStream(zipEntry);
            }

So far this is looking ok (not quite done yet).

A request came in to index stuffit files. I'm hoping to be able to do
something similar as the above, but I haven't found a Java api to work with
this file type. Anyone have any experience with this?

Thanks,

Luke




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Stanislav Jordanov <st...@sirma.bg>.

Thank you guys,

there's a good chance that I will have the management persuaded to drop the
'random access requirement'.
As you surely know, the management (usually) tends to be franticly
optimistic.
True to this trend, our management suggested  us (the R&D team) that:
"... it is time to assume that we will have to modify the core Lucene engine
in order to achieve what we want."
Quite everyone in our craft likes to think of his(or her)self as fairly good
programming geek.
But honestly speeking, I'm lacking the confidence that I can all of a sudden
introduce a significant improvement in Lucene -
a far-from-trivial framework that has been evolving for the last 4 (or 5)
years.

Highly appreciating the work you've done:
Stenly

----- Original Message ----- 
From: "Doug Cutting" <cu...@apache.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Wednesday, March 02, 2005 12:04 AM
Subject: Re: Fast access to a random page of the search results.


> Daniel Naber wrote:
> > After fixing this I can reproduce the problem with a local index that
> > contains about 220.000 documents (700MB). Fetching the first document
> > takes for example 30ms, fetching the last one takes >100ms. Of course I
> > tested this with a query that returns many results (about 50.000).
> > Actually it happens even with the default sorting, no need to sort by
some
> > specific field.
>
> In part this is due to the fact that Hits first searches for the
> top-scoring 100 documents.  Then, if you ask for a hit after that, it
> must re-query.  In part this is also due to the fact that maintaining a
> queue of the top 50k hits is more expensive than maintaining a queue of
> the top 100 hits, so the second query is slower.  And in part this could
> be caused by other things, such as that the highest ranking document
> might tend to be cached and not require disk io.
>
> One could perform profiling to determine which is the largest factor.
> Of these, only the first is really fixable: if you know you'll need hit
> 50k then you could tell this to Hits and have it perform only a single
> query.  But the algorithmic cost of keeping the queue of the top 50k is
> the same as collecting all the hits and sorting them.  So, in part,
> getting hits 49,990 through 50,000 is inherently slower than getting
> hits 0-10.  We can minimize that, but not eliminate it.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Doug Cutting <cu...@apache.org>.

Daniel Naber wrote:
> After fixing this I can reproduce the problem with a local index that 
> contains about 220.000 documents (700MB). Fetching the first document 
> takes for example 30ms, fetching the last one takes >100ms. Of course I 
> tested this with a query that returns many results (about 50.000). 
> Actually it happens even with the default sorting, no need to sort by some 
> specific field.

In part this is due to the fact that Hits first searches for the 
top-scoring 100 documents.  Then, if you ask for a hit after that, it 
must re-query.  In part this is also due to the fact that maintaining a 
queue of the top 50k hits is more expensive than maintaining a queue of 
the top 100 hits, so the second query is slower.  And in part this could 
be caused by other things, such as that the highest ranking document 
might tend to be cached and not require disk io.

One could perform profiling to determine which is the largest factor. 
Of these, only the first is really fixable: if you know you'll need hit 
50k then you could tell this to Hits and have it perform only a single 
query.  But the algorithmic cost of keeping the queue of the top 50k is 
the same as collecting all the hits and sorting them.  So, in part, 
getting hits 49,990 through 50,000 is inherently slower than getting 
hits 0-10.  We can minimize that, but not eliminate it.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Fast access to a random page of the search results.

Posted by Daniel Naber <da...@t-online.de>.

On Tuesday 01 March 2005 19:15, Doug Cutting wrote:

> 'nHits - nHits' always equals zero.  So you're actually printing the
> first document, not the last.  The last document would be accessed with
> 'hits.doc(nHits)'.

After fixing this I can reproduce the problem with a local index that 
contains about 220.000 documents (700MB). Fetching the first document 
takes for example 30ms, fetching the last one takes >100ms. Of course I 
tested this with a query that returns many results (about 50.000). 
Actually it happens even with the default sorting, no need to sort by some 
specific field.

Regards
 Daniel

-- 
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Fast access to a random page of the search results.

Posted by Stanislav Jordanov <st...@sirma.bg>.

You're rihgt:  nHits - nHits == 0 :)
But I did the right tests - it just happened that I've sent you a wrong
source.
I mean I performed the tests accessing the proper last doc: doc(nHits - 1)
then I switched to accessing the first hit, just to make sure (once again)
there is essential difference in access times.
And instead of wiping out the code fragment (nHits - 1) and replacing it
with 0, a replaced 1 with nHits.
That's how the resulting (nHits - nHits) code got posted.

Yes, the index is stored at a local hard drive.

Stenly



----- Original Message ----- 
From: "Doug Cutting" <cu...@apache.org>
To: "Lucene Users List" <lu...@jakarta.apache.org>
Sent: Tuesday, March 01, 2005 8:15 PM
Subject: Re: Fast access to a random page of the search results.


> Stanislav Jordanov wrote:
> >                 startTs = System.currentTimeMillis();
> >                 dummyMethod(hits.doc(nHits - nHits));
> >                 stopTs = System.currentTimeMillis();
> >                 System.out.println("Last doc accessed in " + (stopTs -
> > startTs)
> >                                     + "ms");
>
> 'nHits - nHits' always equals zero.  So you're actually printing the
> first document, not the last.  The last document would be accessed with
> 'hits.doc(nHits)'.  Accessing the last document should not be much
> slower (or faster) than accessing the first.
>
> 200+ milliseconds to access a document does seem slow.  Where is you
> index stored?  On a local hard drive?
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Doug Cutting <cu...@apache.org>.

Stanislav Jordanov wrote:
>                 startTs = System.currentTimeMillis();
>                 dummyMethod(hits.doc(nHits - nHits));
>                 stopTs = System.currentTimeMillis();
>                 System.out.println("Last doc accessed in " + (stopTs -
> startTs)
>                                     + "ms");

'nHits - nHits' always equals zero.  So you're actually printing the 
first document, not the last.  The last document would be accessed with 
'hits.doc(nHits)'.  Accessing the last document should not be much 
slower (or faster) than accessing the first.

200+ milliseconds to access a document does seem slow.  Where is you 
index stored?  On a local hard drive?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Fast access to a random page of the search results.

Posted by Stanislav Jordanov <st...@sirma.bg>.

// The test source code (second attempt).
// Just in case the .txt attachment does not pass through
// I am pasting the code here:

package index_test;

import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.*;
import java.util.Enumeration;
import java.util.StringTokenizer;
import java.util.ArrayList;

import com.odi.util.query.QueryParseException;

public class Search {
    public static void main(String[] args) throws Exception {
        if (args.length != 1) {
            throw new Exception("Usage: " + Search.class.getName() + "
<index dir>");
        }

        File indexDir = new File(args[0]);

        if (!indexDir.exists() || !indexDir.isDirectory()) {
            throw new Exception(indexDir + " is does not exist or is not a
directory.");
        }

        System.out.println("Using index: " + indexDir.getCanonicalPath());

        BooleanQuery.setMaxClauseCount(Integer.MAX_VALUE);
        search(indexDir);
    }

    public static void search(File indexDir)  throws Exception {
        Directory fsDir = FSDirectory.getDirectory(indexDir, false);
        IndexSearcher is = null;
        BufferedReader brdr = new BufferedReader(new
InputStreamReader(System.in));

        String q;
        Sort sort = null;
        while (!(q = brdr.readLine()).equals("exit")) {
            q = q.trim();
            if (is == null || q.equals("newsearcher")) {
                is = new IndexSearcher(fsDir);
                if (q.equals("newsearcher")) {
                    continue;
                }
            }
            if (q.startsWith("sort ")) {
                StringTokenizer tkz = new StringTokenizer(q);
                tkz.nextToken(); // skip the "sort" word
                ArrayList<SortField> sortFields = new
ArrayList<SortField>();
                while (tkz.hasMoreTokens()) {
                    String tok = tkz.nextToken();
                    boolean reverse = false;
                    if (tok.startsWith("-")) {
                        tok = tok.substring(1);
                        reverse = true;
                    }
                    sortFields.add(new SortField(tok, reverse));
                }
                sort = new Sort(sortFields.toArray(new SortField[0]));
                System.out.println("Sorting by " + sort);
                continue;
            }
            if (q.equals("nosort")) {
                sort = null;
                System.out.println("Sorting is off");
                continue;
            }
            long startTs = System.currentTimeMillis();
            Query query = null;
            try {
                query  = QueryParser.parse(q, "qcontent", new
StandardAnalyzer(new String[0]));
            }
            catch (QueryParseException exn) {
                exn.printStackTrace();
                continue;
            }
            Hits hits = (sort != null ? is.search(query, sort) :
is.search(query));
            int nHits = hits.length();//hc.nHits;
            long stopTs  = System.currentTimeMillis();
            System.out.println("Found " + nHits + " document(s) that matched
query '" + q + "'");
            System.out.println("Sorting by " + sort);
            System.out.println("query executed in " + (stopTs - startTs) +
"ms");

            if (nHits > 0) {
                startTs = System.currentTimeMillis();
                dummyMethod(hits.doc(nHits - nHits));
                stopTs = System.currentTimeMillis();
                System.out.println("Last doc accessed in " + (stopTs -
startTs)
                                    + "ms");
            }
        }
    }

    public static double  dummyMethod(Document doc) {
        return doc.getBoost();
    }

    private static void  dumpDocument(Document doc) throws IOException {
        System.out.println("<DOCUMENT-DUMP>");
        for (Enumeration e = doc.fields(); e.hasMoreElements(); ) {
            Field f = (Field) e.nextElement();
            System.out.println(f.name() + " ::>> '" + f.stringValue() +
"'");
        }
        System.out.println("</DOCUMENT-DUMP>");
    }
}

Re: Fast access to a random page of the search results.

Posted by Kelvin Tan <ke...@relevanz.com>.

Hi Mark, partially, yes. But I suppose for Document has to be further subclassed so that the other non-initialized fields can be obtained as well, or perhaps an additional method to init the remaining fields from a partially initialized Doc?

Thanks for responding..
k

On Mon, 07 Mar 2005 21:00:52 +0000, markharw00d wrote:
> Did you mean this?
> http://marc.theaimsgroup.com/?l=lucene-user&m=108525376821114&w=2
>
>
> Kelvin Tan wrote:
>
>> This is a bump post...
>>
>> I'm wondering if there's any code (contributed, bugzilla, core or
>> otherwise) that provides document lazy-loading functionality,
>> i.e. only eager-initialize specific fields, or load fields on-
>> demand.
>>
>> Thanks,
>> k
>>
>> On Thu, 3 Mar 2005 13:55:00 +0100, Kelvin Tan wrote:
>>
>>
>>> Is this actually in the codebase?  I couldn't find it in SVN or
>>> in Bugzilla...
>>>
>>> kelvin
>>>
>>> On Mon, 28 Feb 2005 11:59:54 -0500, Erik Hatcher wrote:
>>>
>>>
>>>> Or perhaps you
>>>> need to investigate the (is it in the codebase already?)
>>>> patch to load fields lazily upon demand instead.
>>>>
>>>>
>>> ----------------------------------------------------------------
>>> ---- - To unsubscribe, e-mail: java-user-
>>> unsubscribe@lucene.apache.org For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>
>>
>> ------------------------------------------------------------------
>> --- To unsubscribe, e-mail: java-user-
>> unsubscribe@lucene.apache.org For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by markharw00d <ma...@yahoo.co.uk>.

Did you mean this?
http://marc.theaimsgroup.com/?l=lucene-user&m=108525376821114&w=2




Kelvin Tan wrote:

>This is a bump post...
>
>I'm wondering if there's any code (contributed, bugzilla, core or otherwise) that provides document lazy-loading functionality, i.e. only eager-initialize specific fields, or load fields on-demand.
>
>Thanks,
>k
>
>On Thu, 3 Mar 2005 13:55:00 +0100, Kelvin Tan wrote:
>  
>
>> Is this actually in the codebase?  I couldn't find it in SVN or in
>> Bugzilla...
>>
>> kelvin
>>
>> On Mon, 28 Feb 2005 11:59:54 -0500, Erik Hatcher wrote:
>>    
>>
>>> Or perhaps you
>>> need to investigate the (is it in the codebase already?) patch to
>>>  load fields lazily upon demand instead.
>>>      
>>>
>> --------------------------------------------------------------------
>> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>    
>>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>  
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Kelvin Tan <ke...@relevanz.com>.

This is a bump post...

I'm wondering if there's any code (contributed, bugzilla, core or otherwise) that provides document lazy-loading functionality, i.e. only eager-initialize specific fields, or load fields on-demand.

Thanks,
k

On Thu, 3 Mar 2005 13:55:00 +0100, Kelvin Tan wrote:
> Is this actually in the codebase?  I couldn't find it in SVN or in
> Bugzilla...
>
> kelvin
>
> On Mon, 28 Feb 2005 11:59:54 -0500, Erik Hatcher wrote:
>> Or perhaps you
>> need to investigate the (is it in the codebase already?) patch to
>>  load fields lazily upon demand instead.
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Kelvin Tan <ke...@relevanz.com>.

Is this actually in the codebase?  I couldn't find it in SVN or in Bugzilla...

kelvin

On Mon, 28 Feb 2005 11:59:54 -0500, Erik Hatcher wrote:
> Or perhaps you
> need to investigate the (is it in the codebase already?) patch to
> load fields lazily upon demand instead.



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Fast access to a random page of the search results.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 28, 2005, at 10:39 AM, Stanislav Jordanov wrote:
> > What did you do in your private investigation?
> 1. empirical tests with an index of nearly 75,000 docs (I am attaching 
> the test source)

Only certain (.txt?) attachments are allowed to come through on the 
mailing list.

> > Sorted by descending relevance (the default), or in some other way?
> In some other way - sorted by some column (asc or desc - doesn't 
> matter)

Using IndexSearcher(query, sort)?

>  > If a search is fast enough, as you report, then you can simply start
> > your access to Hits at the appropriate spot.  For the current systems
> > I'm working on, this is the approach I've used - start iterating hits
> > at (pageNumber - 1) * numberOfItemsPerPage.
> >
> > Is that approach insufficient?
> I'm afraid this is not sufficient;
> Either I am doing something wrong,
> or it is not that simple:
> following is a log from my test session;
> It appears that IndexSearcher.search(...) finishes rather fast
> compared to the time it takes to fetch the last document from the Hits 
> object.

I assume you are only accessing the documents you wish to display 
rather than all of them up to where you need.   Also keep in mind that 
accessing a Document is when the document is pulled from the index.  If 
you have a large amount of data in a document it will take a 
corresponding amount of time to load it.  You may need to restructure 
what you store in a document to reduce the load times.  Or perhaps you 
need to investigate the (is it in the codebase already?) patch to load 
fields lazily upon demand instead.

	Erik

>

> The log starts here:
>
> pa
>
> Found 74222 document(s) that matched query 'pa'
>
> Sorting by "sfile_name"
>
> query executed in 16ms
>
> Last doc accessed in 375ms
>
> us
>
> Found 74222 document(s) that matched query 'us'
>
> Sorting by "sfile_name"
>
> query executed in 31ms
>
> Last doc accessed in 219ms
>
> 1
>
> Found 74222 document(s) that matched query '1'
>
> Sorting by "sfile_name"
>
> query executed in 15ms
>
> Last doc accessed in 235ms
>
> 5
>
> Found 74222 document(s) that matched query '5'
>
> Sorting by "sfile_name"
>
> query executed in 422ms
>
> Last doc accessed in 219ms
>
> 6
>
> Found 72759 document(s) that matched query '6'
>
> Sorting by "sfile_name"
>
> query executed in 344ms
>
> Last doc accessed in 250ms
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Fast access to a random page of the search results.

Posted by Stanislav Jordanov <st...@sirma.bg>.

> What did you do in your private investigation?
1. empirical tests with an index of nearly 75,000 docs (I am attaching the test source)
2. reviewing and tracing the source code of Lucene
(I do not claim I have gained a deep understanding of it ;-)

> Sorted by descending relevance (the default), or in some other way?
In some other way - sorted by some column (asc or desc - doesn't matter)

> If a search is fast enough, as you report, then you can simply start 
> your access to Hits at the appropriate spot.  For the current systems 
> I'm working on, this is the approach I've used - start iterating hits 
> at (pageNumber - 1) * numberOfItemsPerPage.
> 
> Is that approach insufficient?

I'm afraid this is not sufficient;
Either I am doing something wrong,
or it is not that simple:
following is a log from my test session;
It appears that IndexSearcher.search(...) finishes rather fast
compared to the time it takes to fetch the last document from the Hits object.
The log starts here:
pa

Found 74222 document(s) that matched query 'pa'

Sorting by "sfile_name"

query executed in 16ms

Last doc accessed in 375ms

us

Found 74222 document(s) that matched query 'us'

Sorting by "sfile_name"

query executed in 31ms

Last doc accessed in 219ms

1

Found 74222 document(s) that matched query '1'

Sorting by "sfile_name"

query executed in 15ms

Last doc accessed in 235ms

5

Found 74222 document(s) that matched query '5'

Sorting by "sfile_name"

query executed in 422ms

Last doc accessed in 219ms

6

Found 72759 document(s) that matched query '6'

Sorting by "sfile_name"

query executed in 344ms

Last doc accessed in 250ms

Re: Fast access to a random page of the search results.

Posted by Erik Hatcher <er...@ehatchersolutions.com>.

On Feb 28, 2005, at 6:00 AM, Stanislav Jordanov wrote:
> my private investigation already left me sceptic about the outcome of 
> this
> issue,
> but I've decided to post it as a final resort.

What did you do in your private investigation?

> Suppose I have an index of about 5,000,000 docs
> and I am running a single term queries against it, including queries 
> which
> return say 1,000,000 or even more hits.
>
> The hits are sorted by some column and I am happy with the query 
> execution
> time (i.e. the time spent in the IndexSearcher.query(...) method).
> Now comes the problem: it is a product requirement that the client is
> allowed to quickly access (by scrolling) a random page of the result 
> set.
> Put in different words the app must quickly (in less that a second) 
> respond
> to requests like: "Give me the results from No 567100 to No 567200"
> (remember the results are sorted thus ordered).

Sorted by descending relevance (the default), or in some other way?

If a search is fast enough, as you report, then you can simply start 
your access to Hits at the appropriate spot.  For the current systems 
I'm working on, this is the approach I've used - start iterating hits 
at (pageNumber - 1) * numberOfItemsPerPage.

Is that approach insufficient?

	Erik


> I took a look at Lucene's internals which only left me with the 
> suspision
> that this is an impossible task.
> Would anyone, please, prove my suspision wrong?
>
> Regards
> Stanislav
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Fast access to a random page of the search results.

Posted by Volodymyr Bychkoviak <vb...@i-hypergrid.com>.

just retrieve Documents from 567100 to 567200 from Hits class you got 
while searching.

Stanislav Jordanov wrote:

>Guys,
>my private investigation already left me sceptic about the outcome of this
>issue,
>but I've decided to post it as a final resort.
>Perhaps the gurus know the right answer :-)
>
>Suppose I have an index of about 5,000,000 docs
>and I am running a single term queries against it, including queries which
>return say 1,000,000 or even more hits.
>
>The hits are sorted by some column and I am happy with the query execution
>time (i.e. the time spent in the IndexSearcher.query(...) method).
>Now comes the problem: it is a product requirement that the client is
>allowed to quickly access (by scrolling) a random page of the result set.
>Put in different words the app must quickly (in less that a second) respond
>to requests like: "Give me the results from No 567100 to No 567200"
>(remember the results are sorted thus ordered).
>I took a look at Lucene's internals which only left me with the suspision
>that this is an impossible task.
>Would anyone, please, prove my suspision wrong?
>
>Regards
>Stanislav
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
>For additional commands, e-mail: lucene-user-help@jakarta.apache.org
>
>
>  
>

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Fast access to a random page of the search results.

Posted by Stanislav Jordanov <st...@sirma.bg>.

Guys,
my private investigation already left me sceptic about the outcome of this
issue,
but I've decided to post it as a final resort.
Perhaps the gurus know the right answer :-)

Suppose I have an index of about 5,000,000 docs
and I am running a single term queries against it, including queries which
return say 1,000,000 or even more hits.

The hits are sorted by some column and I am happy with the query execution
time (i.e. the time spent in the IndexSearcher.query(...) method).
Now comes the problem: it is a product requirement that the client is
allowed to quickly access (by scrolling) a random page of the result set.
Put in different words the app must quickly (in less that a second) respond
to requests like: "Give me the results from No 567100 to No 567200"
(remember the results are sorted thus ordered).
I took a look at Lucene's internals which only left me with the suspision
that this is an impossible task.
Would anyone, please, prove my suspision wrong?

Regards
Stanislav



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: Search performance with one index vs. many indexes

Posted by Morus Walter <mo...@tanto.de>.

Jochen Franke writes:
> Topic: Search performance with large numbers of indexes vs. one large index
> 
> 
> My questions are:
> 
> - Is the size of the "wordlist" the problem?
> - Would we be a lot faster, when we have a smaller number
> of files per index?

sure. 
Look:
Index lookup of a word is O(ln(n)) where n is the number of words.
Index lookup of a word in k indexes having m words is O( k ln(m) )
In the best case all word lists are distict (purely theoretical), 
that is n = k*m or m = n/k
For n = 15 Mio, k = 800
ln(n) = 16.5
k*ln(n/k) = 7871
In a realistic case, m is much bigger since word lists won't be distinct.
But it's the linear factor k that bites you.
In the worst case (all words in all indices) you have
k*ln(n) = 13218.8

HTH
	Morus

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org