You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Winton Davies <wd...@overture.com> on 2001/11/29 06:09:18 UTC

Parallelising a query...

Hi,
 
   Let say I want to retrieve all relevant listings for a query (just 
suppose)...

   I have 4 million documents... I could:
 
   Split these into 4 x 1 million document indexes  and then send a 
query to 4 Lucene processes ? At the end I would have to sort the 
results by relevance.

   Question for Doug or any other Search Engine guru -- would this 
reduce the time to find these results by 75% ?
 
   I know it is probably a hard question to answer (i.e. all the 
documents that match, might just be in one process...) but I'm more 
getting at the average length of the inverted indexes that have to be 
joined being reduced by 75%, hence the join should take only 25% of 
the time...

  Any thoughts on this idiocy ? Reason why I ask ? Well, lets say I 
can't fit a 4 million document RamDir index into 1GB heap space, but 
I could if I split it up :) ?

   Cheers,
    Winton
 

 
 
 

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

GCJ and Lucene ?

Posted by Winton Davies <wd...@overture.com>.

Hi,

Another maybe quick question:

Has anyone tried using GCJ with Lucene ?

http://www.gnu.org/software/gcc/java/

As far as I tell, this tries to compile Java directly to native code. 
I think it is restricted to 1.1 classes, which might be a gotcha 
(does Lucene use any 1.2 classes ?)

  Winton

Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>

Re: Parallelising a query...

Posted by Winton Davies <wd...@overture.com>.

Hi again....

  Another dumb question :) (actually I'm too busy to look at the code :) )

   In the index, is the datastructure of termDocs (is that the right 
term), sorted by anything? Or is it just insertion order ? I could 
see how one might want to sort by the Doc with the highest term 
frequency  ? But I can also see why
it might not help.

  e.g.   Token1 -> doc1 (2) [occurences] -> doc2 (6) -> doc3 (3)

  or is it like this ?

         Token1 -> doc2 (6) -> doc3 (3) -> doc1 (2) ?

 
  I have an idea for an optimization I want to make, but I'm not sure 
exactly whether it is warrants investigation.

  Winton


Winton Davies
Lead Engineer, Overture (NSDQ: OVER)
1820 Gateway Drive, Suite 360
San Mateo, CA 94404
work: (650) 403-2259
cell: (650) 867-1598
http://www.overture.com/

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>