You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sdeck <sc...@gmail.com> on 2007/07/07 01:35:15 UTC

Related Article question

Hello all,
  I have been trying out the MoreLikeThis and many other similarity types of
queries, but still run into problems with content not being matched up.

Let me give an example, as well as some question that, hopefully someone can
answer, to help me refine my work.

Example:
1) Document A may have a title: Oden and Durant Are being recruited, and
Document B would have a title
Trailblazers look at Oden and Durant.
 Both Document A and B talk about the recruitment of Oden and Durant, just
in fairly different ways.  One may emphasis Oden over Durant, or vice versa.
 The way the MoreLikeThis and similarity queries seem to work is that they
take terms and see if a lot of them match up in the documents. So, if Durant
is ins doc A 10 times and 10 times in doc B, then the similarity will be
higher.

Here is my problem though. I run these morelike this and other similarity
queries and it many of those types of articles do not get matched, because a
lot of the terms are not the same, but they are talking about the same
topic.  

Here is what I wonder
1) Should I somehow give more boost to a full name, or other names, or
titles to help matching? Or, does that hinder things?
2) How does shorter content versus longer content work? I make only get
around 5-6 sentences in one document, but a full page in another, but they
are still talking about the same thing
3) How would term vectors help, versus not storing term vectors?

To also help, the way the system is setup, I have one main index.  I will
run a search of the web and collect more documents. Before adding these to
the main index, I will run a morelikethis query against the main index of
each of the new documents to be inserted.  That way, I can keep a separate
place of what articles are related to each other for faster lookups.  I also
do a query of morelikethis against the new index, just to see what recently
searched articles are similar to each other. 
It would seem that document frequency and term numbers will not really work
in these sorts of scenarios.

Not sure if I am explaining my problem as well as I can, but I would love
some kind of reference to figuring out how to do related article searching
and see how I can refine my results. Right now, I would say about 60-70% get
correctly mapped into related articles, and about 10-20 percent get
incorrectly mapped as a related article (similar terms, but perhaps not
enough content, but the article is not about any of the others)

Any help would be appreciated.
Thanks
Scott
-- 
View this message in context: http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Related Article question

Posted by Ryan Ackley <ry...@gmail.com>.
I was playing around with MoreLikeThis and I noticed the problems you
are talking about as well.

One idea I thought of was for MoreLikeThis to focus only on proper
nouns for the purposes of similarity or give a significant boost to
those. Pretty much the same idea you had in #1.

I found a list of the 1000 most used English words somewhere on the
net (including stemmed variations, see link below). This would be one
way to look for proper nouns. The idea is if the term for the
MoreLikeThis query is in that list, don't use it.

This is a good resource:

http://www1.harenet.ne.jp/~waring/vocab/wordlists/vocfreq.html

On 7/6/07, sdeck <sc...@gmail.com> wrote:
>
> Hello all,
>  I have been trying out the MoreLikeThis and many other similarity types of
> queries, but still run into problems with content not being matched up.
>
> Let me give an example, as well as some question that, hopefully someone can
> answer, to help me refine my work.
>
> Example:
> 1) Document A may have a title: Oden and Durant Are being recruited, and
> Document B would have a title
> Trailblazers look at Oden and Durant.
>  Both Document A and B talk about the recruitment of Oden and Durant, just
> in fairly different ways.  One may emphasis Oden over Durant, or vice versa.
>  The way the MoreLikeThis and similarity queries seem to work is that they
> take terms and see if a lot of them match up in the documents. So, if Durant
> is ins doc A 10 times and 10 times in doc B, then the similarity will be
> higher.
>
> Here is my problem though. I run these morelike this and other similarity
> queries and it many of those types of articles do not get matched, because a
> lot of the terms are not the same, but they are talking about the same
> topic.
>
> Here is what I wonder
> 1) Should I somehow give more boost to a full name, or other names, or
> titles to help matching? Or, does that hinder things?
> 2) How does shorter content versus longer content work? I make only get
> around 5-6 sentences in one document, but a full page in another, but they
> are still talking about the same thing
> 3) How would term vectors help, versus not storing term vectors?
>
> To also help, the way the system is setup, I have one main index.  I will
> run a search of the web and collect more documents. Before adding these to
> the main index, I will run a morelikethis query against the main index of
> each of the new documents to be inserted.  That way, I can keep a separate
> place of what articles are related to each other for faster lookups.  I also
> do a query of morelikethis against the new index, just to see what recently
> searched articles are similar to each other.
> It would seem that document frequency and term numbers will not really work
> in these sorts of scenarios.
>
> Not sure if I am explaining my problem as well as I can, but I would love
> some kind of reference to figuring out how to do related article searching
> and see how I can refine my results. Right now, I would say about 60-70% get
> correctly mapped into related articles, and about 10-20 percent get
> incorrectly mapped as a related article (similar terms, but perhaps not
> enough content, but the article is not about any of the others)
>
> Any help would be appreciated.
> Thanks
> Scott
> --
> View this message in context: http://www.nabble.com/Related-Article-question-tf4038641.html#a11474031
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org