You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Manuel LeNormand <ma...@gmail.com> on 2013/04/25 00:29:12 UTC

Too many unique terms

Hi there,
Looking at my index (about 1M docs) i see lot of unique terms, more
than 8M which is a significant part of my total term count. These are very
likely useless terms, binaries or other meaningless numbers that come with
few of my docs.
I am totally fine with deleting them so these terms would be unsearchable.
Thinking about it i get that
1. It is impossible apriori knowing if it is unique term or not, so i
cannot add them to my stop words.
2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
do contain useless data. It's a problem for me as im short on memory.

Q:
Assuming a constant index, is there a way of deleting all terms that are
unique from at least the dictionary tim and tip files? Do i need to enter
the source code for this, and if yes what par of it?
 Will i get significant query time performance increase beside the better
RAM use benefit?
Are there any written updateProcessor classes that identify non human
readable terms?

Thanks in advance,
Manu

Re: Too many unique terms

Posted by Adrien Grand <jp...@gmail.com>.
Hi,

On Mon, Apr 29, 2013 at 10:38 PM, Manuel Le Normand
<ma...@gmail.com> wrote:
> I want to make sure: iterating with the TermsEnum will not delete all the
> terms occuring in the same doc that includes the single term, but only the
> single term right?
> Going through the Class TermEnum i cannot find any "delete" method, how can
> i do this?

Sorry, I've been confusing. Deleting a term will delete all documents
that match this term.

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Too many unique terms

Posted by Manuel Le Normand <ma...@gmail.com>.
On Mon, Apr 29, 2013 at 1:22 PM, Adrien Grand <jp...@gmail.com> wrote:

> On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand
> <ma...@gmail.com> wrote:
> > Hi, real thanks for the previous reply.
> > For now i'm not able to make a separation between these useless words,
> > whether they contain words or digits.
> > I liked the idea of iterating with TermsEnum. Will it also delete the
> > occurances of these terms in the other file formats (termVectors etc.)?
>
> Yes it will. But since Lucene ony marks documents as deleted, you will
> need to force a merge in order to expunge deletes.
>

I want to make sure: iterating with the TermsEnum will not delete all the
terms occuring in the same doc that includes the single term, but only the
single term right?
Going through the Class TermEnum i cannot find any "delete" method, how can
i do this?


> > As i understand, the strField implementation is a kind of TrieField
> ordered
> > by the leading char (as searches support wildcards), every term in the
> > Dictionnary points to the inverted file (frq) to find the list (not
> bitmap)
> > of the docs containing the term.
>
> These details are codec-specific, but they are correct for the current
> postings format. You can have a look at
>
> https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary
> for more information.
>
> > Let's say i query for the term "hello" many times within different
> queries,
> > the O.S will load into memory the matching 4k chunk from the Dictionary
> and
> > frq. If most of my terms are garbage, much of the Dictionnary chunk will
> be
> > useless, whereas the frq chunk will be more efficiently used as it
> contains
> > all the <termFreq> list. Still i'm not sure a typical
> <termFreqs,skipData>
> > chunk per term gets to 4k.
>
> Postings lists are compressed and most terms are usually present in
> only a few documents so most postings lists are likely much smaller
> than 4kb.
>
I actually get far smaller entries. Assuming linearity, i get about 30
bytes only for each term in the *.tim files and an average of 5 bytes per
doc frec (=occurance of all the terms) - supprisingly efficient and low.
Anyway, that's not in the order of magnitude of 4k. As of this, i will not
attempt to tune this. Calculations (and assumptions) show ommiting all the
unique Terms will reduce the *.tim file by 80-90%, but as these terms occur
only 10% of the words. Thus they will give about this amount of reduction
from the pos and frq files. I guess the trie tree will be a bit more
efficient, i don't reckon it's worthy.

If someone else ever tuned this param I'd love to know.


> > If my assumption's right, i should lower down the memory chunks (through
> > the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk
> for
> > a single term in the frq (neglecting for instance the use of prx and
> > termVectors). Any cons to the idea? Do you have any estimation of the
> > magnitude of a frq chunk for a N-times occuring term, or how can i check
> it
> > on my own.
>
> I've never been tuning this myself. I guess the main issue is that it
> could increase bookkeeping (to keep track of the pages) and thus CPU
> usage.
>
> Unfortunately the size of the postings lists is hard to predict
> because it depends on the data. They compress better when they are
> large and evenly distributed across all doc IDs. You could try to
> compare the sum of your doc freqs with the total byte size of the
> postings list to get a rough estimate.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Too many unique terms

Posted by Adrien Grand <jp...@gmail.com>.
On Sat, Apr 27, 2013 at 8:41 PM, Manuel Le Normand
<ma...@gmail.com> wrote:
> Hi, real thanks for the previous reply.
> For now i'm not able to make a separation between these useless words,
> whether they contain words or digits.
> I liked the idea of iterating with TermsEnum. Will it also delete the
> occurances of these terms in the other file formats (termVectors etc.)?

Yes it will. But since Lucene ony marks documents as deleted, you will
need to force a merge in order to expunge deletes.

> As i understand, the strField implementation is a kind of TrieField ordered
> by the leading char (as searches support wildcards), every term in the
> Dictionnary points to the inverted file (frq) to find the list (not bitmap)
> of the docs containing the term.

These details are codec-specific, but they are correct for the current
postings format. You can have a look at
https://builds.apache.org/job/Lucene-Artifacts-trunk/javadoc/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Termdictionary
for more information.

> Let's say i query for the term "hello" many times within different queries,
> the O.S will load into memory the matching 4k chunk from the Dictionary and
> frq. If most of my terms are garbage, much of the Dictionnary chunk will be
> useless, whereas the frq chunk will be more efficiently used as it contains
> all the <termFreq> list. Still i'm not sure a typical <termFreqs,skipData>
> chunk per term gets to 4k.

Postings lists are compressed and most terms are usually present in
only a few documents so most postings lists are likely much smaller
than 4kb.

> If my assumption's right, i should lower down the memory chunks (through
> the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk for
> a single term in the frq (neglecting for instance the use of prx and
> termVectors). Any cons to the idea? Do you have any estimation of the
> magnitude of a frq chunk for a N-times occuring term, or how can i check it
> on my own.

I've never been tuning this myself. I guess the main issue is that it
could increase bookkeeping (to keep track of the pages) and thus CPU
usage.

Unfortunately the size of the postings lists is hard to predict
because it depends on the data. They compress better when they are
large and evenly distributed across all doc IDs. You could try to
compare the sum of your doc freqs with the total byte size of the
postings list to get a rough estimate.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Too many unique terms

Posted by Manuel Le Normand <ma...@gmail.com>.
Hi, real thanks for the previous reply.
For now i'm not able to make a separation between these useless words,
whether they contain words or digits.
I liked the idea of iterating with TermsEnum. Will it also delete the
occurances of these terms in the other file formats (termVectors etc.)?

As i understand, the strField implementation is a kind of TrieField ordered
by the leading char (as searches support wildcards), every term in the
Dictionnary points to the inverted file (frq) to find the list (not bitmap)
of the docs containing the term.

Let's say i query for the term "hello" many times within different queries,
the O.S will load into memory the matching 4k chunk from the Dictionary and
frq. If most of my terms are garbage, much of the Dictionnary chunk will be
useless, whereas the frq chunk will be more efficiently used as it contains
all the <termFreq> list. Still i'm not sure a typical <termFreqs,skipData>
chunk per term gets to 4k.

If my assumption's right, i should lower down the memory chunks (through
the OS) to about the 0.9th percentile of the <termFreq,skipData> chunk for
a single term in the frq (neglecting for instance the use of prx and
termVectors). Any cons to the idea? Do you have any estimation of the
magnitude of a frq chunk for a N-times occuring term, or how can i check it
on my own.

Thanks,
Manu


On Thu, Apr 25, 2013 at 2:04 AM, Adrien Grand <jp...@gmail.com> wrote:

> Hi Manuel,
>
> On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand
> <ma...@gmail.com> wrote:
> > Hi there,
> > Looking at my index (about 1M docs) i see lot of unique terms, more
> > than 8M which is a significant part of my total term count. These are
> very
> > likely useless terms, binaries or other meaningless numbers that come
> with
> > few of my docs.
>
> If you are only interested in letters, one option is to change your
> analysis chain to use LetterTokenizer. This tokenizer will split on
> everything that is not a letter, filtering out numbers and binary
> data.
>
> > I am totally fine with deleting them so these terms would be
> unsearchable.
> > Thinking about it i get that
> > 1. It is impossible apriori knowing if it is unique term or not, so i
> > cannot add them to my stop words.
> > 2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
> > do contain useless data. It's a problem for me as im short on memory.
> >t
> > Q:
> > Assuming a constant index, is there a way of deleting all terms that are
> > unique from at least the dictionary tim and tip files? Do i need to enter
> > the source code for this, and if yes what par of it?
>
> If frequencies are indexed, you can pull a TermsEnum, iterate through
> the terms dictionary and delete terms that are less frequent than a
> given threshold. As you said, this will however prevent your users
> from searching for these terms anymore.
>
> >  Will i get significant query time performance increase beside the better
> > RAM use benefit?
>
> This is hard to answer. Having fewer terms in the terms dictionary
> should make search a little faster but I can't tell you by how much.
> You should also try to disable features that you don't use. For
> example, if you don't need positional information or frequencies,
> IndexOptions.DOCS_ONLY will make your postings lists smaller.
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Too many unique terms

Posted by Adrien Grand <jp...@gmail.com>.
Hi Manuel,

On Thu, Apr 25, 2013 at 12:29 AM, Manuel LeNormand
<ma...@gmail.com> wrote:
> Hi there,
> Looking at my index (about 1M docs) i see lot of unique terms, more
> than 8M which is a significant part of my total term count. These are very
> likely useless terms, binaries or other meaningless numbers that come with
> few of my docs.

If you are only interested in letters, one option is to change your
analysis chain to use LetterTokenizer. This tokenizer will split on
everything that is not a letter, filtering out numbers and binary
data.

> I am totally fine with deleting them so these terms would be unsearchable.
> Thinking about it i get that
> 1. It is impossible apriori knowing if it is unique term or not, so i
> cannot add them to my stop words.
> 2. I have a performance decrease cause my cached "hot spot" chuncks (4kb)
> do contain useless data. It's a problem for me as im short on memory.
>t
> Q:
> Assuming a constant index, is there a way of deleting all terms that are
> unique from at least the dictionary tim and tip files? Do i need to enter
> the source code for this, and if yes what par of it?

If frequencies are indexed, you can pull a TermsEnum, iterate through
the terms dictionary and delete terms that are less frequent than a
given threshold. As you said, this will however prevent your users
from searching for these terms anymore.

>  Will i get significant query time performance increase beside the better
> RAM use benefit?

This is hard to answer. Having fewer terms in the terms dictionary
should make search a little faster but I can't tell you by how much.
You should also try to disable features that you don't use. For
example, if you don't need positional information or frequencies,
IndexOptions.DOCS_ONLY will make your postings lists smaller.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org