You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by ma...@yahoo.co.uk on 2004/02/29 23:54:32 UTC

More Like This Query updated plus benchmarks

I have updated the MoreLikeThis query generator to address a few issues.
The code is available here: http://home.clara.net/markharwood/lucene/MoreLikeThis.java
I have added comments at the top of the class to describe the changes.

I was interested in the benefits of the new TermVector code so I benchmarked 
it's effect on average time to generate a "MoreLikeThis" Query object for varying sized example 
docs from indexes with and without TermVector support:

For avg example doc size of 250 bytes :
VectorIndex  21 msecs
NoVectorIndex   37 msecs

For avg example doc size of 1,000 bytes :
VectorIndex  25 msecs
NoVectorIndex   48 msecs

For avg example doc size of 16,000 bytes :
VectorIndex 235 ms 
NoneVectorIndex356 ms

For avg example doc size of 150,000 bytes :
VectorIndex 533 ms 
NoneVectorIndex1809 ms


TermVector support is beneficial and its effects are more noticeable in larger docs.
However, once you get into 200k sized docs you probably want to look at ways to improve 
performance. 

A tokenizing size limit is an obvious way to optimise performance for large docs without term vectors
This cuts down on tokenizing time but may reduce the quality of results.
I introduced a default "5000" term limit on tokenization and this cut the 1809ms in the above 
results down to 612 ms
I haven't been able to test for the quality of results produced by this query (my 150k docs were made 
by concatenating several smaller, docs of different subject matter together).
Looking at the query terms produced however it seems to compare reasonably with the vector-produced one:

* 5k tokenize limit query=: colchest our essex home us we you from flower uk site click your ship compani new servic page 01206 fashion gift here music florist busi 

* Full vector query=: colchest our essex you flower we us click home school from your suffolk florist site about here servic uk new deliveri gift page an 01206



I'm not currently sure what the approach would be to optimising performance for TermVector-backed queries
when using large example docs.


On a related subject: now that I understand the TermVector feature better (and found there is no 
position data) I can't see a way that it is of any benefit to optimising the highlighter code.
I'd previously thought term sequence was in there.


Cheers
Mark









---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: More Like This Query updated plus benchmarks

Posted by Felix Huber <hu...@webtopia.de>.
> I have updated the MoreLikeThis query generator to address a few
> issues. 
> The code is available here:
> http://home.clara.net/markharwood/lucene/MoreLikeThis.java 
> I have added comments at the top of the class to describe the changes.
> 
Possible it's time for a cvs checkin (sandbox)?

Regards,
Felix Huber

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Re: Filtering out duplicate documents...

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
My impression is the new term vector support should at least make this 
type of comparison feasible in some manner.  I'd be interested to see 
what you come up with if you give this a try.  You will need the latest 
CVS codebase.

	Erik


On Mar 8, 2004, at 4:37 PM, Michael Giles wrote:

> I'm looking for a way to filter out duplicate documents from an index 
> (either while indexing, or after the fact).  It seems like there 
> should be an approach of comparing the terms for two documents, but 
> I'm wondering if any other folks (i.e. nutch) have come up with a 
> solution to this problem.
>
> Obviously you can compute the Levenstein distance on the text, but 
> that is way too computationally intensive to scale.  So the goal is to 
> find something that would be workable in a production system.  For 
> example, a given NYT article, and its printer friendly version should 
> be deemed to be the same.
>
> -Mike
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-user-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org


Filtering out duplicate documents...

Posted by Michael Giles <mg...@visionstudio.com>.
I'm looking for a way to filter out duplicate documents from an index 
(either while indexing, or after the fact).  It seems like there should be 
an approach of comparing the terms for two documents, but I'm wondering if 
any other folks (i.e. nutch) have come up with a solution to this problem.

Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

-Mike



---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org