You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Matt Chaput <ma...@sidefx.com> on 2007/03/20 00:13:44 UTC
Writing out the term count when merging
Hi all!
I'm reimplementing a very Lucene-like search library as a learning
experience and I've run into a snag. Before I go deep code diving, I
thought I'd ask here in case someone has the time to answer.
The term dictionary file includes the term count in a header. But when
I'm merging segments, I can't know the collected number of UNIQUE terms
in the merging segments before I've read them, so I can't write the
header before I start merging the segments.
The ways I can see to do this are (a) to scan the term lists of the
segments first and build the collected term list in memory before
merging, (b) leave space in the file for the term count and go back and
overwrite it later, or (c) something much more clever that Lucene does
but I haven't figured out yet.
(b) is undesirable for me, because I'd like the option of using
compressed streams in the backend, which must be written serially.
Anyway, if someone more familiar with the code could point me in the
right direction, I'd appreciate it very much.
Thanks!
Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Writing out the term count when merging
Posted by Matt Chaput <ma...@sidefx.com>.
robert engels wrote:
> but a better solution, since you probably need a indexed file into the
> terms file, you might not even need the term count, since you should
> read the indexed file into memory anyway (read every 16 entries, etc.) -
> at which point you will know the number of terms in the file.
I thought the same thing originally-- it's not strictly necessary-- but
then I was glad to see that Lucene had it just because it freed me from
dealing with EOF, which for some reason seems difficult in Python (the
language I'm using for this project). But maybe I should just suck it up ;).
Matt
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Writing out the term count when merging
Posted by robert engels <re...@ix.netcom.com>.
but a better solution, since you probably need a indexed file into
the terms file, you might not even need the term count, since you
should read the indexed file into memory anyway (read every 16
entries, etc.) - at which point you will know the number of terms in
the file.
On Mar 19, 2007, at 6:13 PM, Matt Chaput wrote:
> Hi all!
>
> I'm reimplementing a very Lucene-like search library as a learning
> experience and I've run into a snag. Before I go deep code diving,
> I thought I'd ask here in case someone has the time to answer.
>
> The term dictionary file includes the term count in a header. But
> when I'm merging segments, I can't know the collected number of
> UNIQUE terms in the merging segments before I've read them, so I
> can't write the header before I start merging the segments.
>
> The ways I can see to do this are (a) to scan the term lists of the
> segments first and build the collected term list in memory before
> merging, (b) leave space in the file for the term count and go back
> and overwrite it later, or (c) something much more clever that
> Lucene does but I haven't figured out yet.
>
> (b) is undesirable for me, because I'd like the option of using
> compressed streams in the backend, which must be written serially.
>
> Anyway, if someone more familiar with the code could point me in
> the right direction, I'd appreciate it very much.
>
> Thanks!
>
> Matt
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: Writing out the term count when merging
Posted by robert engels <re...@ix.netcom.com>.
write the term count at the end of the file, uncompressed
On Mar 19, 2007, at 6:13 PM, Matt Chaput wrote:
> Hi all!
>
> I'm reimplementing a very Lucene-like search library as a learning
> experience and I've run into a snag. Before I go deep code diving,
> I thought I'd ask here in case someone has the time to answer.
>
> The term dictionary file includes the term count in a header. But
> when I'm merging segments, I can't know the collected number of
> UNIQUE terms in the merging segments before I've read them, so I
> can't write the header before I start merging the segments.
>
> The ways I can see to do this are (a) to scan the term lists of the
> segments first and build the collected term list in memory before
> merging, (b) leave space in the file for the term count and go back
> and overwrite it later, or (c) something much more clever that
> Lucene does but I haven't figured out yet.
>
> (b) is undesirable for me, because I'd like the option of using
> compressed streams in the backend, which must be written serially.
>
> Anyway, if someone more familiar with the code could point me in
> the right direction, I'd appreciate it very much.
>
> Thanks!
>
> Matt
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org