You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by "K. M. McCormick" <ky...@gmail.com> on 2009/08/10 05:45:41 UTC

Terms-Across-All-Documents

Hello There:

I am currently working on an INDEX STAT GENERATOR I'd like to use for some
term-weight tests in a (rather large) Lucene Index. In general, the stats
I'm hoping to work with are based on a term's frequency across the entire
indexed document set.

TFIDF easily works in Lucene's searcher - and you can get access to a Term's
DF (across all documents, obviously) quite easily. However, TF in Lucene
seems limited to a by-document basis. Meaning, to generate the number of
times this term has appeared in the indexed document set, I would have to
(hypothetically) do the following:

- Given Term t, find TF(t)
- Get the enumeration of t over the index - TermDocs (so I have doc, freq
pairings)
- For each (doc, freq) pair, add freq to the total-index-frequency

So if I have x terms, I would be iterating through x*TF(t) for the entire
index to find out the index-frequency for all terms. Is this the only method
of getting this information?

Since my data set (and term set) are quite large, I was trying to find if
there was another mechanism in place for Lucene, either at the indexing or
the searching level. However, I've had little luck sifting through the
information I've gotten (mostly points me to TFIDF) to find out if Lucene
has something I can use to make this process faster.

I have also read a bit about TermVectors, but those seem by-document as
well.

If there isn't a method at the search level (or,
after-index-complete-level), I would be willing to accept the overhead of
generating these stats at indexing time, if that would be more efficient...

Thanks,
drago

Re: Terms-Across-All-Documents

Posted by Ted Dunning <te...@gmail.com>.

I think that a more reasonable approach for experiments like this is to
store statistics of the sort that you want as part of the indexing process.
That will give you complete flexibility to do what you need.

Then at retrieval time you can access and pass in term level information
into a custom similarity function.  That leaves you with a good barrier
between index-time and search-time, but still gives you any information that
you might like to use.  Having your data in a side file helps you avoid
having to deal with those aspects of Lucene that are highly oriented around
efficiency which are good in their place, but could make your research work
much more difficult in the exploratory phase.

On Sun, Aug 9, 2009 at 8:45 PM, K. M. McCormick <ky...@gmail.com>wrote:

> Hello There:
>
> I am currently working on an INDEX STAT GENERATOR I'd like to use for some
> term-weight tests in a (rather large) Lucene Index. In general, the stats
> I'm hoping to work with are based on a term's frequency across the entire
> indexed document set.
>
> TFIDF easily works in Lucene's searcher - and you can get access to a
> Term's
> DF (across all documents, obviously) quite easily. However, TF in Lucene
> seems limited to a by-document basis. Meaning, to generate the number of
> times this term has appeared in the indexed document set, I would have to
> (hypothetically) do the following:
>
> - Given Term t, find TF(t)
> - Get the enumeration of t over the index - TermDocs (so I have doc, freq
> pairings)
> - For each (doc, freq) pair, add freq to the total-index-frequency
>
> So if I have x terms, I would be iterating through x*TF(t) for the entire
> index to find out the index-frequency for all terms. Is this the only
> method
> of getting this information?
>
> Since my data set (and term set) are quite large, I was trying to find if
> there was another mechanism in place for Lucene, either at the indexing or
> the searching level. However, I've had little luck sifting through the
> information I've gotten (mostly points me to TFIDF) to find out if Lucene
> has something I can use to make this process faster.
>
> I have also read a bit about TermVectors, but those seem by-document as
> well.
>
> If there isn't a method at the search level (or,
> after-index-complete-level), I would be willing to accept the overhead of
> generating these stats at indexing time, if that would be more efficient...
>
> Thanks,
> drago
>



-- 
Ted Dunning, CTO
DeepDyve