You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Winton Davies <wd...@overture.com> on 2001/11/15 00:05:24 UTC

Efficient doc information retrieval.

Hi all,
 
  Thanks for all your continuing help! I have got the go ahead to 
build a production-level prototype of my project. I have to be able 
to serve several 100s of queries a second (on big boxes), and I'm 
currently getting 2 or 3 seconds/query with a sloppy phrase match. I 
was trying to profile my usercode, and I saw that it is the 
uniqification loop that is killing me.

  In my application, I have to be able to return a list of documents, 
that have been uniqified according to an accountID. The most relevant 
document for an accountID is returned, and then susequent hits that 
have the same accountID are dropped.

  So, in a recent search of an 8 million document index, I got around 
200? hits sloppy phrase hits, and I needed to weed out the duplicates.

  so the pseudo code is:

   while ( i < hits.length && resultSet.size < 40) {
    accountID = doc(i).get("accountID");
    if hashtable.get(accountID) != null continue;
    else insert accountID in hashtable, add result to resultSet.
  }

  I timed it, and I was getting about 60 msecs each time round that 
loop, which makes me suspect the doc(i).get().

  This seems to be really inefficient (the query is a sloppy Phrase 
matcher). Any ideas how I can speed this up? I'm obviously going to 
try a RAMDirectory version, but it seems that the 60msec delay is 
over the top ?

I guess the short version of this is

  (a) Is there a way to do this uniqification somehow in the index itself ?
  (b) or have a special kind of field which is ultrafast to access given "i" ?
  (c) or anyway to speed up the existing behaviour!

  Cheers,
  Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Efficient doc information retrieval.

Posted by Winton Davies <wd...@overture.com>.
Thanks anyway ! Much appreciate you thinking about it ?
 
Cheers,
   Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Efficient doc information retrieval.

Posted by "W. Eliot Kimber" <el...@isogen.com>.
Winton Davies wrote:
> 
> Hi Eliot,
> 
>   Not really, all documents have an accountID, but I need to search
> all the documents
> first, and each document that is returned has an accountID, but I
> just want one document
> per accountID.

I see the problem. Can't think of any other way to solve it than to
post-process the returned docs as you described. 

Eliot Kimber
ISOGEN International
eliot@isogen.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Efficient doc information retrieval.

Posted by Winton Davies <wd...@overture.com>.
Hi Eliot,

  Not really, all documents have an accountID, but I need to search 
all the documents
first, and each document that is returned has an accountID, but I 
just want one document
per accountID.

so:

  doc1 acc1
  doc2 acc1
  doc3 acc1
  doc4 acc2
  doc5 acc2
  doc6 acc2

Lets say the query "X" returns hits in this order:

  doc1
  doc2
  doc3
  doc4
  doc5

what I want returned is:

  doc1  (best of acc1)
  doc4  (best of acc2)

Note that creating a seperate Index for each account is impractical 
(30K+ accountID).

  Cheers,
   Winton



At 17:30 -0600 11/14/01, W. Eliot Kimber wrote:
>Winton Davies wrote:
>>
>> Hi all,
>
>>   In my application, I have to be able to return a list of documents,
>> that have been uniqified according to an accountID. The most relevant
>> document for an accountID is returned, and then susequent hits that
>> have the same accountID are dropped.
>
>Do you mean that certain documents are associated with particular
>account IDs? If so, why not include the account ID as part of the query?
>Or have I missed something?
>
>Cheers,
>
>Eliot Kimber
>ISOGEN International
>eliot@isogen.com
>
>--
>To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
>For additional commands, e-mail: <ma...@jakarta.apache.org>


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Efficient doc information retrieval.

Posted by "W. Eliot Kimber" <el...@isogen.com>.
Winton Davies wrote:
> 
> Hi all,

>   In my application, I have to be able to return a list of documents,
> that have been uniqified according to an accountID. The most relevant
> document for an accountID is returned, and then susequent hits that
> have the same accountID are dropped.

Do you mean that certain documents are associated with particular
account IDs? If so, why not include the account ID as part of the query?
Or have I missed something?

Cheers,

Eliot Kimber
ISOGEN International
eliot@isogen.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>