You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Matt Chaput <ma...@sidefx.com> on 2007/03/20 00:13:44 UTC

Writing out the term count when merging

Hi all!

I'm reimplementing a very Lucene-like search library as a learning 
experience and I've run into a snag. Before I go deep code diving, I 
thought I'd ask here in case someone has the time to answer.

The term dictionary file includes the term count in a header. But when 
I'm merging segments, I can't know the collected number of UNIQUE terms 
in the merging segments before I've read them, so I can't write the 
header before I start merging the segments.

The ways I can see to do this are (a) to scan the term lists of the 
segments first and build the collected term list in memory before 
merging, (b) leave space in the file for the term count and go back and 
overwrite it later, or (c) something much more clever that Lucene does 
but I haven't figured out yet.

(b) is undesirable for me, because I'd like the option of using 
compressed streams in the backend, which must be written serially.

Anyway, if someone more familiar with the code could point me in the 
right direction, I'd appreciate it very much.

Thanks!

Matt



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Writing out the term count when merging

Posted by Matt Chaput <ma...@sidefx.com>.

robert engels wrote:
> but a better solution, since you probably need a indexed file into the 
> terms file, you might not even need the term count, since you should 
> read the indexed file into memory anyway (read every 16 entries, etc.) - 
> at which point you will know the number of terms in the file.

I thought the same thing originally-- it's not strictly necessary-- but 
then I was glad to see that Lucene had it just because it freed me from 
dealing with EOF, which for some reason seems difficult in Python (the 
language I'm using for this project). But maybe I should just suck it up ;).

Matt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Writing out the term count when merging

Posted by robert engels <re...@ix.netcom.com>.

but a better solution, since you probably need a indexed file into  
the terms file, you might not even need the term count, since you  
should read the indexed file into memory anyway (read every 16  
entries, etc.) - at which point you will know the number of terms in  
the file.


On Mar 19, 2007, at 6:13 PM, Matt Chaput wrote:

> Hi all!
>
> I'm reimplementing a very Lucene-like search library as a learning  
> experience and I've run into a snag. Before I go deep code diving,  
> I thought I'd ask here in case someone has the time to answer.
>
> The term dictionary file includes the term count in a header. But  
> when I'm merging segments, I can't know the collected number of  
> UNIQUE terms in the merging segments before I've read them, so I  
> can't write the header before I start merging the segments.
>
> The ways I can see to do this are (a) to scan the term lists of the  
> segments first and build the collected term list in memory before  
> merging, (b) leave space in the file for the term count and go back  
> and overwrite it later, or (c) something much more clever that  
> Lucene does but I haven't figured out yet.
>
> (b) is undesirable for me, because I'd like the option of using  
> compressed streams in the backend, which must be written serially.
>
> Anyway, if someone more familiar with the code could point me in  
> the right direction, I'd appreciate it very much.
>
> Thanks!
>
> Matt
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: Writing out the term count when merging

Posted by robert engels <re...@ix.netcom.com>.

write the term count at the end of the file, uncompressed

On Mar 19, 2007, at 6:13 PM, Matt Chaput wrote:

> Hi all!
>
> I'm reimplementing a very Lucene-like search library as a learning  
> experience and I've run into a snag. Before I go deep code diving,  
> I thought I'd ask here in case someone has the time to answer.
>
> The term dictionary file includes the term count in a header. But  
> when I'm merging segments, I can't know the collected number of  
> UNIQUE terms in the merging segments before I've read them, so I  
> can't write the header before I start merging the segments.
>
> The ways I can see to do this are (a) to scan the term lists of the  
> segments first and build the collected term list in memory before  
> merging, (b) leave space in the file for the term count and go back  
> and overwrite it later, or (c) something much more clever that  
> Lucene does but I haven't figured out yet.
>
> (b) is undesirable for me, because I'd like the option of using  
> compressed streams in the backend, which must be written serially.
>
> Anyway, if someone more familiar with the code could point me in  
> the right direction, I'd appreciate it very much.
>
> Thanks!
>
> Matt
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org