You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dan sutton <da...@gmail.com> on 2011/01/18 13:13:05 UTC

Large .frq file

Hi,

We're trying to create a large index via solr for trends and notice
that we have a large '.frq' file after doing the following:


make all text fields index="true", stored="false",
omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
termOffsets="false" termVectors="false"

We are using a variation on org.apache.lucene.analysis.cjk and notice
that the .frq is about 4 time larger than, for example, the
WhiteSpaceTokenizer.


Considering that with omitTermFreqAndPositions="true" for the text
fields I'd have thought this should be : "If omitTf were true it would
be this sequence of VInts instead:"
(http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)


Can anyone suggest how I can reduce the size of this file?


Many thanks,
Dan

Lucene Specification Version: 2.9.1
Solr Specification Version: 1.4.0.2010.09.10.17.10.36

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Large .frq file

Posted by dan sutton <da...@gmail.com>.
Hi Shai,

What I really wanted to do was reduce the frq file size

Oddly (when tokenizing 3 seperate fields) with the
WhitespaceTokenizer, more terms are produced than with the CJK
analyzer and the CJK frq filesize is much larger ... examples below:

with WhitespaceTokenizer:
89M 	        _0.tis
1.4M 	_0.tii
71 	        _0.fnm
5.8M  	_0.fdx
741K 	_0.fdt
20  	       segments.gen
293  	segments_2
119M  	_0.frq

with CJKTokenizer:
31M   	_0.tis
633K 	_0.tii
71  	        _0.fnm
5.8M 	_0.fdx
741K  	_0.fdt
20  	        segments.gen
293  	segments_2
166M  	_0.frq

Also I believe solr calls addDocument with payLoads turned off. I'm
not sure why the size is much larger.

Cheers,
Dan

On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
> If I understand correctly, you compare the size of the .frq when
> WhitespaceTokenizer is used, vs the CJK ones?
>
> I'd bet this is because WhitespaceTokenizer creates far less terms than the
> CJK one. Whitespace tokenizes the text by separating on whitespace, while
> CJK does sort of N-Gram tokenization, which usually leads to much more terms
> created. This affects the .frq file in that there are much more posting
> lists created, which are stored in the .frq file.
>
> See if the .tii and .tis files differ and if their difference is the same
> order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
> should be of the same order of difference), then I believe this is the
> reason.
>
> Shai
>
> On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <da...@gmail.com> wrote:
>
>> Hi,
>>
>> We're trying to create a large index via solr for trends and notice
>> that we have a large '.frq' file after doing the following:
>>
>>
>> make all text fields index="true", stored="false",
>> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
>> termOffsets="false" termVectors="false"
>>
>> We are using a variation on org.apache.lucene.analysis.cjk and notice
>> that the .frq is about 4 time larger than, for example, the
>> WhiteSpaceTokenizer.
>>
>>
>> Considering that with omitTermFreqAndPositions="true" for the text
>> fields I'd have thought this should be : "If omitTf were true it would
>> be this sequence of VInts instead:"
>> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>>
>>
>> Can anyone suggest how I can reduce the size of this file?
>>
>>
>> Many thanks,
>> Dan
>>
>> Lucene Specification Version: 2.9.1
>> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Large .frq file

Posted by Shai Erera <se...@gmail.com>.
If I understand correctly, you compare the size of the .frq when
WhitespaceTokenizer is used, vs the CJK ones?

I'd bet this is because WhitespaceTokenizer creates far less terms than the
CJK one. Whitespace tokenizes the text by separating on whitespace, while
CJK does sort of N-Gram tokenization, which usually leads to much more terms
created. This affects the .frq file in that there are much more posting
lists created, which are stored in the .frq file.

See if the .tii and .tis files differ and if their difference is the same
order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
should be of the same order of difference), then I believe this is the
reason.

Shai

On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <da...@gmail.com> wrote:

> Hi,
>
> We're trying to create a large index via solr for trends and notice
> that we have a large '.frq' file after doing the following:
>
>
> make all text fields index="true", stored="false",
> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
> termOffsets="false" termVectors="false"
>
> We are using a variation on org.apache.lucene.analysis.cjk and notice
> that the .frq is about 4 time larger than, for example, the
> WhiteSpaceTokenizer.
>
>
> Considering that with omitTermFreqAndPositions="true" for the text
> fields I'd have thought this should be : "If omitTf were true it would
> be this sequence of VInts instead:"
> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>
>
> Can anyone suggest how I can reduce the size of this file?
>
>
> Many thanks,
> Dan
>
> Lucene Specification Version: 2.9.1
> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>