You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Alex vB <ma...@avomberg.de> on 2011/04/22 03:52:05 UTC

New codecs keep Freq skip/omit Pos

Hello everybody,

I am currently testing several new Lucene 4.0 codec implementations to
compare with an own solution.
The difference is that I am only indexing frequencies and not positions. I
would like to have this for the other codecs. I know there was already a
post for this topic
http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html. 

I just wanted to ask if there has something changed especially for the new
codecs.
I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those
they right places for adapting Pos/Freq handling? What would happen if I
just skip writing postions/payloads? Would it mess up the index? 

The written files have different endings like pyl, skp, pos, doc etc. Gives
me "not counting" the pos file a correct index size estimation for W Freqs
W/O Pos? Or where exactly are term positions written?

Regards
Alex

PS: Some results with the current codecs if someone is interested. I indexed
10% of Wikipedia(english).
Each version is indexed as document.

Docs	240179
Versions	8467927
Distinct Terms	3501214
total Terms	1520008204
Avg. Versions	35.25
Avg. Terms per Version	179.50
Avg. Terms per Doc	6328.65

PforDelta W Freq W Pos	       20.6 GB
PforDelta W/O Freq W/O Pos	         1.6 GB
Standard 4.0 W Freq W Pos	       28.1 GB
Standard 4.0 W/O Freq W/O Pos	 6.2 GB
Pfor W Freq W Pos	                  22 GB
Pfor W/O Freq W/O Pos	         3.1 GB

Performance follows ;)


--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2849776.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Apr 22, 2011 at 12:03 PM, Alex vB <ma...@avomberg.de> wrote:
> During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter,
> StopFilter).
> Can I get somewhere more information for Codec creation or is there just
> "grubbing" through the code?

try the following patch to switch PFOR1 and PFOR2 over to Sep, so that
they create separate .doc and .frq files.
then you can compare the compression of the freqs against your
implementation (again the .skp/.tib/.tiv will be larger due to using
Sep codec and due to having pos pointers, but try to ignore that)


Index: lucene/src/java/org/apache/lucene/index/codecs/pfordelta/PatchedFrameOfRefCodec.java
===================================================================
--- lucene/src/java/org/apache/lucene/index/codecs/pfordelta/PatchedFrameOfRefCodec.java	(revision
1095422)
+++ lucene/src/java/org/apache/lucene/index/codecs/pfordelta/PatchedFrameOfRefCodec.java	(working
copy)
@@ -30,6 +30,8 @@
 import org.apache.lucene.index.codecs.FieldsProducer;
 import org.apache.lucene.index.codecs.fixed.FixedPostingsReaderImpl;
 import org.apache.lucene.index.codecs.fixed.FixedPostingsWriterImpl;
+import org.apache.lucene.index.codecs.sep.SepPostingsReaderImpl;
+import org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl;
 import org.apache.lucene.index.codecs.standard.StandardCodec;
 import org.apache.lucene.index.codecs.BlockTermsWriter;
 import org.apache.lucene.index.codecs.BlockTermsReader;
@@ -48,7 +50,7 @@

   @Override
   public FieldsConsumer fieldsConsumer(SegmentWriteState state)
throws IOException {
-    PostingsWriterBase postingsWriter = new
FixedPostingsWriterImpl(state, new PForDeltaFactory(128));
+    PostingsWriterBase postingsWriter = new
SepPostingsWriterImpl(state, new PForDeltaFactory(128));

     boolean success = false;
     TermsIndexWriterBase indexWriter;
@@ -79,7 +81,7 @@

   @Override
   public FieldsProducer fieldsProducer(SegmentReadState state) throws
IOException {
-    PostingsReaderBase postingsReader = new FixedPostingsReaderImpl(state.dir,
+    PostingsReaderBase postingsReader = new SepPostingsReaderImpl(state.dir,

state.segmentInfo,

state.readBufferSize,
                                                                   new
PForDeltaFactory(128),
@@ -125,14 +127,14 @@

   @Override
   public void files(Directory dir, SegmentInfo segmentInfo, String
id, Set<String> files) {
-    FixedPostingsReaderImpl.files(segmentInfo, id, files);
+    SepPostingsReaderImpl.files(segmentInfo, id, files);
     BlockTermsReader.files(dir, segmentInfo, id, files);
     VariableGapTermsIndexReader.files(dir, segmentInfo, id, files);
   }

   @Override
   public void getExtensions(Set<String> extensions) {
-    FixedPostingsWriterImpl.getExtensions(extensions);
+    SepPostingsWriterImpl.getExtensions(extensions);
     BlockTermsReader.getExtensions(extensions);
     VariableGapTermsIndexReader.getIndexExtensions(extensions);
   }
Index: lucene/src/java/org/apache/lucene/index/codecs/pfordelta2/PForDeltaFixedIntBlockCodec.java
===================================================================
--- lucene/src/java/org/apache/lucene/index/codecs/pfordelta2/PForDeltaFixedIntBlockCodec.java	(revision
1095422)
+++ lucene/src/java/org/apache/lucene/index/codecs/pfordelta2/PForDeltaFixedIntBlockCodec.java	(working
copy)
@@ -41,6 +41,8 @@
 import org.apache.lucene.index.codecs.VariableGapTermsIndexReader;
 import org.apache.lucene.index.codecs.VariableGapTermsIndexWriter;
 import org.apache.lucene.index.codecs.sep.IntStreamFactory;
+import org.apache.lucene.index.codecs.sep.SepPostingsReaderImpl;
+import org.apache.lucene.index.codecs.sep.SepPostingsWriterImpl;
 import org.apache.lucene.index.codecs.standard.StandardCodec;
 import org.apache.lucene.store.*;
 import org.apache.lucene.util.BytesRef;
@@ -168,7 +170,7 @@

   @Override
   public FieldsConsumer fieldsConsumer(SegmentWriteState state)
throws IOException {
-    PostingsWriterBase postingsWriter = new
FixedPostingsWriterImpl(state, new PForDeltaIntFactory());
+    PostingsWriterBase postingsWriter = new
SepPostingsWriterImpl(state, new PForDeltaIntFactory());

     boolean success = false;
     TermsIndexWriterBase indexWriter;
@@ -199,7 +201,7 @@

   @Override
   public FieldsProducer fieldsProducer(SegmentReadState state) throws
IOException {
-    PostingsReaderBase postingsReader = new FixedPostingsReaderImpl(state.dir,
+    PostingsReaderBase postingsReader = new SepPostingsReaderImpl(state.dir,

state.segmentInfo,

state.readBufferSize,

new PForDeltaIntFactory(), state.codecId);
@@ -244,14 +246,14 @@

   @Override
   public void files(Directory dir, SegmentInfo segmentInfo, String
codecId, Set<String> files) {
-    FixedPostingsReaderImpl.files(segmentInfo, codecId, files);
+    SepPostingsReaderImpl.files(segmentInfo, codecId, files);
     BlockTermsReader.files(dir, segmentInfo, codecId, files);
     VariableGapTermsIndexReader.files(dir, segmentInfo, codecId, files);
   }

   @Override
   public void getExtensions(Set<String> extensions) {
-    FixedPostingsWriterImpl.getExtensions(extensions);
+    SepPostingsWriterImpl.getExtensions(extensions);
     BlockTermsReader.getExtensions(extensions);
     VariableGapTermsIndexReader.getIndexExtensions(extensions);
   }

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Alex vB <ma...@avomberg.de>.
> it depends upon the type of query.. what queries are you using for
> this benchmarking and how are you benchmarking?
> FYI: for benchmarking standard query types with wikipedia you might be
> interested in http://code.google.com/a/apache-extras.org/p/luceneutil/

I have 10000 queries from a AOL data set where the followed link lead to
wikipedia.
I benchmark by warming up the indexSearcher with 5000 and perform the test
with the remaining 5000 queries. I just measure the time needed to execute
the queries. I use QueryParser.

> wait, you are indexing payloads for your tests with these other codecs
> when it says "W POS" ?

No only my last implementation uses payloads. All others not. Therefore I
use a payload aware query for Huffman.

> keep in mind that even adding a single payload to your index slows
> down the decompression of the positions tremendously, because payload
> lengths are intertwined with the positions. For block codecs payloads
> really need to be done differently so that blocks of positions are
> really just blocks of positions. This hasn't yet been fixed for the
> sep nor the fixed layouts, so if you add any payloads, and then
> benchmark positional queries then the results are not realistic.

Oh I know that payloads slow down query processing but I wasn't aware of the
block codec problem. I suggest you mean with not realistic they will be
slower? Some numbers for Huffman:
20 Bytes segements.gen
234.6 KB fdt
1.8 MB fdx
20 bytes fnm
626.1 MB pos
1.7 GB pyl
17.8 MB skp
39.8 MB tib
2028.5 KB tiv
268 Bytes Segments_2
214.6 MB doc

I used here for query processing PayloadQueryParser and adapt the similarity
according to my payloads.

> No they do not, only if you use a payload based query such as
> PayloadTermQuery. Normal non-positional queries like TermQuery and
> even normal positional queries like PhraseQuery don't fetch payloads
> at all...

Sorry my question was misleading. I already focused on a payload aware
query. When I use one how exactly are the payload informations fetched from
disk? For example if a query needs to read two posting lists. Are all
payloads fetched for them directly or is Lucene at first making a boolean
intersection and then retrieves the payloads for documents within that
intersection?

> From the description of what you are doing I don't understand how
> payloads fit in because they are per-position? But, I haven't had the
> time to digest the paper you sent yet.

I will try to summarize it and how I adapted it to Lucene. 

I already mentioned the idea of two levels for versioned document
collections. When I parse Wikipedia I unite for one article all terms of all
versions. From this word bag I extract each distinct term and index it with
Lucene into one document. Frequency information is now "lost" for the first
level but will be stored on the second. This is what I meant with " The
first level contains a posting for a document when a term occurs at least in
one version". For example if an article has two versions like version1: "a b
b" and version2: "a a a c c" only 'a','b' and 'c' are indexed.

For the second level I collected term frequency information during my
parsing step. Those frequencies are stored as a vector in version order. For
the above example the frequency vector for 'a' would be [1,3].  I store
these vectors as payloads which I see as the "second level". Every distinct
term on first level receives a single frequency vector on its first
position. So I somehow abuse payloads.

For query processing I now need to retrieve the docs and payloads. It would
be optimal to process the posting lists first ignoring payloads and then
fetch payloads (frequency information) for the remaining docs. The term
frequency is then used for ranking purposes. At the moment I pick for
ranking the highest value from the freq vector which corresponds to the most
matching version.

Regards
Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org



--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Robert Muir <rc...@gmail.com>.
On Sat, Apr 23, 2011 at 2:06 PM, Alex vB <ma...@avomberg.de> wrote:
>
> I am a little bit curious about the Lucene 3.0 performance results because
> the larger index seems to
> work faster?!? I already ran the test several times. Are my results
> realistic at all? I thought PForDelta/2 would outperform the standard index
> implementations in query processing.

it depends upon the type of query.. what queries are you using for
this benchmarking and how are you benchmarking?
FYI: for benchmarking standard query types with wikipedia you might be
interested in http://code.google.com/a/apache-extras.org/p/luceneutil/

> The last result is my own implementation. I am still looking to get it
> smaller because I think I can improve compression further. For indexing I
> use PForDelta2 in combination with payloads. Those are causing the higher
> runtimes. In memory it looks nice. The gap between my solution and PForDelta
> is already 700 MB. I would say it is an improvement. :D I will have a look
> at it again after I got an index with your adapted implementation.

wait, you are indexing payloads for your tests with these other codecs
when it says "W POS" ?

keep in mind that even adding a single payload to your index slows
down the decompression of the positions tremendously, because payload
lengths are intertwined with the positions. For block codecs payloads
really need to be done differently so that blocks of positions are
really just blocks of positions. This hasn't yet been fixed for the
sep nor the fixed layouts, so if you add any payloads, and then
benchmark positional queries then the results are not realistic.

> Normally all payloads corresponding to a query get fetched, right?

No they do not, only if you use a payload based query such as
PayloadTermQuery. Normal non-positional queries like TermQuery and
even normal positional queries like PhraseQuery don't fetch payloads
at all...

>From the description of what you are doing I don't understand how
payloads fit in because they are per-position? But, I haven't had the
time to digest the paper you sent yet.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Alex vB <ma...@avomberg.de>.
Hi Robert,


the adapted codec is running but it seems to be incredible slow. Will take
some time ;)
Here are some performance results:


 
	 
	 
		
 
				Indexing scheme 
				Index Size 
				Avg. Query performance 
				Max. Query Performance 
		 
		
 
				PforDelta2 W Freq W Pos 
				20.6 GB (3,3 GB w/o .pos) 
				81.97 ms 
				1295 ms 
		 
		
 
				PforDelta2 W/O Freq W/O Pos 
				1.6 GB 
				63.33 ms 
				766 ms 
		 
		
 
				Standard 4.0 W Freq W Pos 
				28.1 GB (8,1 GB w/o .prx) 
				77.71 ms 
				978 ms 
		 
		
 
				Standard 4.0 W/O Freq W/O Pos 
				6.2 GB 
				59.93 ms 
				718 ms 
		 
		
 
				Standard 3.0 W Freq W Pos 
				28.1 GB (8,1 GB w/o .prx) 
				71.41 ms 
				978 ms 
		 
		
 
				Standard 3.0 WO Freq WO Pos 
				6.2 GB 
				72.72 ms 
				 845 ms 
		 
		
 
				PforDelta W Freq W Pos 
				22 GB (5 GB w/o .pos) 
				67.98 ms 
				783 ms 
		 
		
 
				PforDelta W/O Freq W/O Pos 
				3.1 GB 
				56.08 ms 
				596 ms 
		 
		
 
				Huffman BL10 W Freq W/O Pos 
				2.6 GB 
				216.29 ms (Mem 14 ms) 
				1338 ms 
		 
 
 
I am a little bit curious about the Lucene 3.0 performance results because
the larger index seems to
work faster?!? I already ran the test several times. Are my results
realistic at all? I thought PForDelta/2 would outperform the standard index
implementations in query processing. 


The last result is my own implementation. I am still looking to get it
smaller because I think I can improve compression further. For indexing I
use PForDelta2 in combination with payloads. Those are causing the higher
runtimes. In memory it looks nice. The gap between my solution and PForDelta
is already 700 MB. I would say it is an improvement. :D I will have a look
at it again after I got an index with your adapted implementation.


I still have another question. The basic idea in my implementation is to
create a "Two-Level" index structure. It is specialized for versioned
document collections. On the first level I create a posting list entry for a
document whenever a term occurs in one or more of its versions. The second
level holds corresponding term frequency informations. Is it possible to
build such a structure by creating a codec? For query processing it should
filter per boolean query on the first level and only fetch information from
the second level when the document is in the intersection of the first
level. At the moment I use payloads to "simulate" a two-level structure.
Normally all payloads corresponding to a query get fetched, right?


If this structure would be possible there are several more implementations
with promising results (Two-Level Diff/MSA in this paper
http://cis.poly.edu/suel/papers/version.pdf).

Regards Alex



--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2855554.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: New codecs keep Freq skip/omit Pos

Posted by Alex vB <ma...@avomberg.de>.
Wow cool ,

I will give that a try!

Thank you!!

Alex

--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2852370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Robert Muir <rc...@gmail.com>.
On Fri, Apr 22, 2011 at 12:24 PM, Alex vB <ma...@avomberg.de> wrote:
> I also indexed one time with Lucene 3.0. Are those sizes really completely
> the same?
>
> Standard 4.0 W Freq W Pos       28.1 GB
> Standard 4.0 W/O Freq W/O Pos   6.2 GB
> Standard 3.0 W Freq W Pos       28.1 GB
> Standard 3.0 WO Freq WO Pos     6.2 GB
>

They shouldn't be *completely* the same, but for your test (where the
terms dictionary etc is relatively small), they should be very close?

Standard 4.0 is still using the same underlying vByte compression, etc
as the 3.0 index format, though it has some major changes in other
places (e.g. terms dict)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Alex vB <ma...@avomberg.de>.
I also indexed one time with Lucene 3.0. Are those sizes really completely
the same?

Standard 4.0 W Freq W Pos	28.1 GB
Standard 4.0 W/O Freq W/O Pos	6.2 GB
Standard 3.0 W Freq W Pos	28.1 GB
Standard 3.0 WO Freq WO Pos	6.2 GB

Regards
Alex


--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851898.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Alex vB <ma...@avomberg.de>.
Hello Robert,

thank you for the answers! :)
Currently I used PatchedFrameOfRef and PatchedFrameOfRef2. Therefore both
implementations are PForDelta! Sorry my mistake.

PatchedFrameOfRef2: PforDelta W/O Freq W/O Pos               1.6 GB 
PatchedFrameOfRef :  Pfor W/O Freq W/O Pos                      3.1 GB 

Here are some numbers:
PatchedFrameOfRef2 w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,0 MB
segements_2  268 Bytes
_43_0.doc  1,3 GB

PatchedFrameOfRef w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,1 MB
segements_2  267 Bytes
_43_0.doc  2,8 GB

During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter,
StopFilter). 
Can I get somewhere more information for Codec creation or is there just
"grubbing" through the code? 

My own implementation needs 2,8 GB of space including FREQ but not POS. This
is why I am asking because I want somehow compare the result. Compared to 20
GB it is very nice and compared to 1,6 GB it is very bad ;).

Regards
Alex


--
View this message in context: http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851809.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Apr 21, 2011 at 9:52 PM, Alex vB <ma...@avomberg.de> wrote:
>
> PforDelta W Freq W Pos         20.6 GB
> PforDelta W/O Freq W/O Pos               1.6 GB
> Standard 4.0 W Freq W Pos              28.1 GB
> Standard 4.0 W/O Freq W/O Pos    6.2 GB
> Pfor W Freq W Pos                         22 GB
> Pfor W/O Freq W/O Pos            3.1 GB
>

Hi, can you provide some more details on these index size numbers?
* which one is PforDelta versus Pfor? We have 2 PFOR-delta impls,
PatchedFrameOfRef and PatchedFrameOfRef2, that are slightly
different... I'm pretty curious about the huge size differential
between the two though (e.g. 1.6GB versus 3.1GB, can you give more
info/breakdown of file sizes?
* are you using a stopfilter at index time or are you indexing all
terms including stopwords?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: New codecs keep Freq skip/omit Pos

Posted by Robert Muir <rc...@gmail.com>.
On Thu, Apr 21, 2011 at 9:52 PM, Alex vB <ma...@avomberg.de> wrote:
> Hello everybody,
>
> I am currently testing several new Lucene 4.0 codec implementations to
> compare with an own solution.
> The difference is that I am only indexing frequencies and not positions. I
> would like to have this for the other codecs. I know there was already a
> post for this topic
> http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html.
>
> I just wanted to ask if there has something changed especially for the new
> codecs.
> I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those
> they right places for adapting Pos/Freq handling? What would happen if I
> just skip writing postions/payloads? Would it mess up the index?

Unless lots of things are changed about the code :)

All of the code here currently assumes omitTF means omitTFAP (freqs
and positions are omitted). So for it to work it would be good to have
an omitP, and if omitTF=true then omitP is also set to true, but omitP
can be true and omitTF = false. Every place that currently checks if
(omitTF) would need to be evaluated, to determine if it should really
be "if (omitP)" instead. For example, when setting up blockreaders for
a bulkpostingsenum with positions, we would set the positions block
reader to null when omitP = true instead of when omitTF=true.

The Fixed layout is very experimental and messy at the moment, it
might be easier to ignore it and start with Sep (when creating a
FixedIntBlock codec, you can easily choose which layout, just choose
SepPostingsWriter etc instead of FixedPostingsWriter).

The reason I say its probably easier, is that Sep makes much less
assumptions/optimizations and would be easier to modify: it creates
separate .doc, .frq, and .pos files which can all be different block
sizes or even different compression algorithms.

On the other hand, the whole point of the Fixed layout is to take
advantage of the fact that block size is the same across doc, freq,
and pos (it interleaves doc and freq into .doc), and to work the
postings somewhat in "parallel" using skipBlock() when possible. So
this one would be more difficult at the moment due to its nature.

>
> The written files have different endings like pyl, skp, pos, doc etc. Gives
> me "not counting" the pos file a correct index size estimation for W Freqs
> W/O Pos? Or where exactly are term positions written?
>

Well its not totally just subtracting the .pos file, for example there
are pointers to the .pos file in the terms dictionary, skipdata for
the .pos file, etc etc that will be smaller if there is no pos file
(as they dont need to exist)... but these are more things that have to
also be modified to support omitting positions without omitting
frequencies, and I havent even thought about all the other places (i
am sure there are many!).

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org