You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ma...@yahoo.co.uk on 2004/02/29 23:54:32 UTC
More Like This Query updated plus benchmarks
I have updated the MoreLikeThis query generator to address a few issues.
The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java
I have added comments at the top of the class to describe the changes.
I was interested in the benefits of the new TermVector code so I benchmarked
it's effect on average time to generate a "MoreLikeThis" Query object for varying sized example
docs from indexes with and without TermVector support:
For avg example doc size of 250 bytes :
VectorIndex 21 msecs
NoVectorIndex 37 msecs
For avg example doc size of 1,000 bytes :
VectorIndex 25 msecs
NoVectorIndex 48 msecs
For avg example doc size of 16,000 bytes :
VectorIndex 235 ms
NoneVectorIndex356 ms
For avg example doc size of 150,000 bytes :
VectorIndex 533 ms
NoneVectorIndex1809 ms
TermVector support is beneficial and its effects are more noticeable in larger docs.
However, once you get into 200k sized docs you probably want to look at ways to improve
performance.
A tokenizing size limit is an obvious way to optimise performance for large docs without term vectors
This cuts down on tokenizing time but may reduce the quality of results.
I introduced a default "5000" term limit on tokenization and this cut the 1809ms in the above
results down to 612 ms
I haven't been able to test for the quality of results produced by this query (my 150k docs were made
by concatenating several smaller, docs of different subject matter together).
Looking at the query terms produced however it seems to compare reasonably with the vector-produced one:
* 5k tokenize limit query=: colchest our essex home us we you from flower uk site click your ship compani new servic page 01206 fashion gift here music florist busi
* Full vector query=: colchest our essex you flower we us click home school from your suffolk florist site about here servic uk new deliveri gift page an 01206
I'm not currently sure what the approach would be to optimising performance for TermVector-backed queries
when using large example docs.
On a related subject: now that I understand the TermVector feature better (and found there is no
position data) I can't see a way that it is of any benefit to optimising the highlighter code.
I'd previously thought term sequence was in there.
Cheers
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: More Like This Query updated plus benchmarks
Posted by Felix Huber <hu...@webtopia.de>.
> I have updated the MoreLikeThis query generator to address a few
> issues.
> The code is available here:
> http://home.clara.net/markharwood/lucene/MoreLikeThis.java
> I have added comments at the top of the class to describe the changes.
>
Possible it's time for a cvs checkin (sandbox)?
Regards,
Felix Huber
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Re: Filtering out duplicate documents...
Posted by Erik Hatcher <er...@ehatchersolutions.com>.
My impression is the new term vector support should at least make this
type of comparison feasible in some manner. I'd be interested to see
what you come up with if you give this a try. You will need the latest
CVS codebase.
Erik
On Mar 8, 2004, at 4:37 PM, Michael Giles wrote:
> I'm looking for a way to filter out duplicate documents from an index
> (either while indexing, or after the fact). It seems like there
> should be an approach of comparing the terms for two documents, but
> I'm wondering if any other folks (i.e. nutch) have come up with a
> solution to this problem.
>
> Obviously you can compute the Levenstein distance on the text, but
> that is way too computationally intensive to scale. So the goal is to
> find something that would be workable in a production system. For
> example, a given NYT article, and its printer friendly version should
> be deemed to be the same.
>
> -Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org
Filtering out duplicate documents...
Posted by Michael Giles <mg...@visionstudio.com>.
I'm looking for a way to filter out duplicate documents from an index
(either while indexing, or after the fact). It seems like there should be
an approach of comparing the terms for two documents, but I'm wondering if
any other folks (i.e. nutch) have come up with a solution to this problem.
Obviously you can compute the Levenstein distance on the text, but that is
way too computationally intensive to scale. So the goal is to find
something that would be workable in a production system. For example, a
given NYT article, and its printer friendly version should be deemed to be
the same.
-Mike
---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org