You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Erik Hatcher <er...@ehatchersolutions.com> on 2004/06/12 18:54:46 UTC

Term vectors: .tvf format question

I'm digging deeper into the Lucene index format to develop some higher 
level diagrams of its structure.   One thing that is curious to me is 
the term text being stored in the .tvf file.  Why not point to the term 
dictionary by position somehow and avoid duplicating this string, 
saving possibly substantial index size?  I'm assuming this is for 
performance reasons.

Note, the Lucene index file formats documentation needs to be updated - 
TermText is no longer just a String, it is a <PrefixLength,Suffix> 
similar to how terms in the .tis are stored.  I've updated 
fileformats.xml/.html - if I've gotten this wrong, let me know.

Just out of curiosity - are there any other known inconsistencies with 
the file formats documentation?  I'd be happy to fix them up if there 
are any other out of sync issues.  I just happened to spot the one just 
mentioned because I looked in the code to see how term vectors were 
written when I saw that the term text is duplicated.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Term vectors: .tvf format question

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Jun 12, 2004, at 11:46 PM, Doug Cutting wrote:
>> Just out of curiosity - are there any other known inconsistencies 
>> with the file formats documentation?
>
> Good question.  Let me think...
>
> The segments file has also changed format, and this is not yet 
> reflected in the file format documentation.

I've just updated this.  Again, let me know if anything is wrong and 
I'll correct it.

> We should probably also somewhere make clear what's changed.  We 
> promise to do so at the top of the file, but don't.  So perhaps 
> sections which have changed should get "since 1.4" or "changed in 1.4" 
> notices or somesuch.  This will make life much easier for ports.

I would like to, sometime in the future, formalize the file format 
structure somehow.  Perhaps an XML file that describes each file, its 
bits and bytes in the detail that is currently done with 
fileformats.html.  If we did this rigorously enough, it shouldn't be 
too hard of a leap to have some type of verification test to ensure the 
format created matches the specified structure.

I'm not sure if any code generation could be done from such a 
descriptor, but maybe that is an option also in order to keep things 
tightly in sync.

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Term vectors: .tvf format question

Posted by Dmitry Serebrennikov <dm...@earthlink.net>.
Doug Cutting wrote:

> So term-number-based vectors would be small and fast to use if all 
> you're using is a single, optimized index, but very slow to use with 
> unoptimized indexes and multiple indexes.  That seems like a bad 
> situtation, so, unless someone figures out another way, we're stuck 
> with the current approach.  Vectors are bigger and slower than 
> optimal, but they're consistently so. 

I'm very familiar with this particular issue :). One solution that has 
worked for my application was to treat terms from different segments / 
indexes as always being different, even if they actually did have the 
same text. Later on in results processing, when the number of terms 
under consideration has been greatly reduced, I was able to do the 
lookups and further consolidate those terms that turned out to be 
identical. Not sure if this is a good general solution, but it has 
worked for me reasonable well.

Dmitry.



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Re: Term vectors: .tvf format question

Posted by Doug Cutting <cu...@apache.org>.
Erik Hatcher wrote:
> I'm digging deeper into the Lucene index format to develop some higher 
> level diagrams of its structure.   One thing that is curious to me is 
> the term text being stored in the .tvf file.  Why not point to the term 
> dictionary by position somehow and avoid duplicating this string, saving 
> possibly substantial index size?  I'm assuming this is for performance 
> reasons.

The prefix compression helps some, but you're right, each term in a 
vector requires several bytes when it could optimally be represented as 
perhaps just one or two bytes on average if we numbered terms.

The problem is maintaining the numbering as the index grows and changes. 
  Lucene indexes grow by merging segments.  With term numbers, each 
segment would have a separate term numbering system.  Terms would be 
renumbered as segments are merged.  This is not hard to implement.  When 
you merge the term dictionaries, keep an array per segment mapping its 
old term numbers to new term numbers in the merged index.  Then use 
these arrays to upgrade the vectors to the new numbering as they're 
copied into the new segment index.  So far so good.  It requires 4 bytes 
per document of RAM when merging.  That makes optimizing large indexes 
much more memory intensive than it is currently, but not prohibitively.

But what happens when you have an unoptimized index and you want to 
compare vectors from two different segments?  There's no way to do this 
without looking up all of the terms in each segment's term dictionary. 
This requires a random disk access per vector term and would hence be 
prohibitively slow.  MultiSearcher would have the same problem.

So term-number-based vectors would be small and fast to use if all 
you're using is a single, optimized index, but very slow to use with 
unoptimized indexes and multiple indexes.  That seems like a bad 
situtation, so, unless someone figures out another way, we're stuck with 
the current approach.  Vectors are bigger and slower than optimal, but 
they're consistently so.

> Note, the Lucene index file formats documentation needs to be updated - 
> TermText is no longer just a String, it is a <PrefixLength,Suffix> 
> similar to how terms in the .tis are stored.  I've updated 
> fileformats.xml/.html - if I've gotten this wrong, let me know.

Looks good to me.  Thanks for catching this!

> Just out of curiosity - are there any other known inconsistencies with 
> the file formats documentation?

Good question.  Let me think...

The segments file has also changed format, and this is not yet reflected 
in the file format documentation.

The skip data description is new.  The text is clumsy, but I think it is 
  mostly accurate.  One mistake is that TIFormat is now -2, not -1. 
Other than that, it looks right to me.

We should probably also somewhere make clear what's changed.  We promise 
to do so at the top of the file, but don't.  So perhaps sections which 
have changed should get "since 1.4" or "changed in 1.4" notices or 
somesuch.  This will make life much easier for ports.

Doug



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org