You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Stephen Thomas <st...@gmail.com> on 2011/11/27 23:42:35 UTC

Do duplicate documents affect term scoring?

List,

I am indexing a subset of Wikipedia. I have 4 years worth of data, and
have taken snapshots of each document at each month in the 4 year
span. Thus, I have 4*12=36 versions of each document. (I keep track of
the timestamp in a field.) I have noticed that in many cases, a
Wikipedia document does not change very much between each version,
sometimes not at all. I end up with duplicate documents, the only
different is the timestamp. Does this impact the term weighting used
by Lucene?

My intuition is that if a term only occurs in one document, but that
document occurs 36 times, then the frequency of the term is
"artificially" increased. Is this true? And if so, is this something I
need to worry about?

Thanks,
Steve

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Do duplicate documents affect term scoring?

Posted by Ian Lea <ia...@gmail.com>.
Lucene won't be aware that you've got duplicate documents, but scoring
does take account of the number of documents in which search terms
appear.  See http://lucene.apache.org/java/3_5_0/scoring.html and the
javadocs for oal.search.Similarity.

Only you can say whether or not you need to worry about it,  If you
do, you could provide your own implementation of Similarity.  Or
change your indexing process to skip updates where only the timestamp
changes.


--
Ian.


On Sun, Nov 27, 2011 at 10:42 PM, Stephen Thomas
<st...@gmail.com> wrote:
> List,
>
> I am indexing a subset of Wikipedia. I have 4 years worth of data, and
> have taken snapshots of each document at each month in the 4 year
> span. Thus, I have 4*12=36 versions of each document. (I keep track of
> the timestamp in a field.) I have noticed that in many cases, a
> Wikipedia document does not change very much between each version,
> sometimes not at all. I end up with duplicate documents, the only
> different is the timestamp. Does this impact the term weighting used
> by Lucene?
>
> My intuition is that if a term only occurs in one document, but that
> document occurs 36 times, then the frequency of the term is
> "artificially" increased. Is this true? And if so, is this something I
> need to worry about?
>
> Thanks,
> Steve
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org