You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by dan sutton <da...@gmail.com> on 2011/01/18 13:13:05 UTC
Large .frq file
Hi,
We're trying to create a large index via solr for trends and notice
that we have a large '.frq' file after doing the following:
make all text fields index="true", stored="false",
omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
termOffsets="false" termVectors="false"
We are using a variation on org.apache.lucene.analysis.cjk and notice
that the .frq is about 4 time larger than, for example, the
WhiteSpaceTokenizer.
Considering that with omitTermFreqAndPositions="true" for the text
fields I'd have thought this should be : "If omitTf were true it would
be this sequence of VInts instead:"
(http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
Can anyone suggest how I can reduce the size of this file?
Many thanks,
Dan
Lucene Specification Version: 2.9.1
Solr Specification Version: 1.4.0.2010.09.10.17.10.36
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Large .frq file
Posted by dan sutton <da...@gmail.com>.
Hi Shai,
What I really wanted to do was reduce the frq file size
Oddly (when tokenizing 3 seperate fields) with the
WhitespaceTokenizer, more terms are produced than with the CJK
analyzer and the CJK frq filesize is much larger ... examples below:
with WhitespaceTokenizer:
89M _0.tis
1.4M _0.tii
71 _0.fnm
5.8M _0.fdx
741K _0.fdt
20 segments.gen
293 segments_2
119M _0.frq
with CJKTokenizer:
31M _0.tis
633K _0.tii
71 _0.fnm
5.8M _0.fdx
741K _0.fdt
20 segments.gen
293 segments_2
166M _0.frq
Also I believe solr calls addDocument with payLoads turned off. I'm
not sure why the size is much larger.
Cheers,
Dan
On Tue, Jan 18, 2011 at 12:41 PM, Shai Erera <se...@gmail.com> wrote:
> If I understand correctly, you compare the size of the .frq when
> WhitespaceTokenizer is used, vs the CJK ones?
>
> I'd bet this is because WhitespaceTokenizer creates far less terms than the
> CJK one. Whitespace tokenizes the text by separating on whitespace, while
> CJK does sort of N-Gram tokenization, which usually leads to much more terms
> created. This affects the .frq file in that there are much more posting
> lists created, which are stored in the .frq file.
>
> See if the .tii and .tis files differ and if their difference is the same
> order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
> should be of the same order of difference), then I believe this is the
> reason.
>
> Shai
>
> On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <da...@gmail.com> wrote:
>
>> Hi,
>>
>> We're trying to create a large index via solr for trends and notice
>> that we have a large '.frq' file after doing the following:
>>
>>
>> make all text fields index="true", stored="false",
>> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
>> termOffsets="false" termVectors="false"
>>
>> We are using a variation on org.apache.lucene.analysis.cjk and notice
>> that the .frq is about 4 time larger than, for example, the
>> WhiteSpaceTokenizer.
>>
>>
>> Considering that with omitTermFreqAndPositions="true" for the text
>> fields I'd have thought this should be : "If omitTf were true it would
>> be this sequence of VInts instead:"
>> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>>
>>
>> Can anyone suggest how I can reduce the size of this file?
>>
>>
>> Many thanks,
>> Dan
>>
>> Lucene Specification Version: 2.9.1
>> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Large .frq file
Posted by Shai Erera <se...@gmail.com>.
If I understand correctly, you compare the size of the .frq when
WhitespaceTokenizer is used, vs the CJK ones?
I'd bet this is because WhitespaceTokenizer creates far less terms than the
CJK one. Whitespace tokenizes the text by separating on whitespace, while
CJK does sort of N-Gram tokenization, which usually leads to much more terms
created. This affects the .frq file in that there are much more posting
lists created, which are stored in the .frq file.
See if the .tii and .tis files differ and if their difference is the same
order of the .frq differences (e.g. if they are 2x larger w/ CJK, so .frq
should be of the same order of difference), then I believe this is the
reason.
Shai
On Tue, Jan 18, 2011 at 2:13 PM, dan sutton <da...@gmail.com> wrote:
> Hi,
>
> We're trying to create a large index via solr for trends and notice
> that we have a large '.frq' file after doing the following:
>
>
> make all text fields index="true", stored="false",
> omitTermFreqAndPositions="true" omitNorms="true" termPositions="false"
> termOffsets="false" termVectors="false"
>
> We are using a variation on org.apache.lucene.analysis.cjk and notice
> that the .frq is about 4 time larger than, for example, the
> WhiteSpaceTokenizer.
>
>
> Considering that with omitTermFreqAndPositions="true" for the text
> fields I'd have thought this should be : "If omitTf were true it would
> be this sequence of VInts instead:"
> (http://lucene.apache.org/java/2_9_1/fileformats.html#Frequencies)
>
>
> Can anyone suggest how I can reduce the size of this file?
>
>
> Many thanks,
> Dan
>
> Lucene Specification Version: 2.9.1
> Solr Specification Version: 1.4.0.2010.09.10.17.10.36
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>