You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by renavatior <lv...@126.com> on 2009/02/12 07:31:24 UTC

How to compute the simlarity of a web page?

I am doing some research in vertical search? Therefore, i defined some
weights of several keywords in my corpus expressing a certain
theme,later,how can i use these to compute the similarity with the given web
page(passed by url to the compute method).I saw the source code of
Similarity.java in Lucene,but i do not know how to use the method such as
TF,IDF,and so on.
i will really appreciate it if anyone can give me some advice,thanks in
advance.
-- 
View this message in context: http://www.nabble.com/How-to-compute-the-simlarity-of-a-web-page--tp21970680p21970680.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to compute the simlarity of a web page?

Posted by Ken Krugler <kk...@transpac.com>.

FWIW, we did something similar with our vertical 
crawl for Krugle. For each web page, we'd 
generate a TreeMap of terms/frequencies. Then 
we'd calculate the angle between this term vector 
representation, and a target term vector we 
generated by analyzing many "good" pages.

Since we were using Nutch, we'd use this score to 
adjust the OPIC weights for the outlinks, thus 
focusing the crawl on pages referenced by what we 
considered to be "good" pages. Though the way 
Nutch used OPIC made it highly susceptible to 
spammy link farms, so you'd need to add code to 
guard against that if you take this approach.

Note that most of the work here will be in 
analyzing the results and then tuning your target 
term vector - e.g. which terms (stop words) do 
you ignore, should you use unit vectors (all 
frequencies set to 1), what set of data do you 
use to generate the target term vector, etc.

Something we didn't do, which seemed valuable, 
would be to use phrases vs. single terms, along 
the lines of Amazon's SIPs (statistically 
improbable phrases).

-- Ken


>çð 2009-02-16àÍìI 22:08 -0500ÅCGrant Ingersollé ì¼ÅF
>>  Hmmm, you might be able to do the following:
>  >
>  > Create a document in a memory index containing the web page
>>  Create a query from the keywords
>  > Do a search with the query against the memory index and see the score.
>  >
>  > Alternatively, you could use the corpus statistics plus to create a 
>>  term vector from the document (as if it were a member of the 
>>  collection) and then do the cosine calculation of that document with 
>>  your query (which you also calculated the weights for based on your
>  > collections stats)
>  >
>  > Last, it sounds like you are essentially describing a categorization 
>>  task.  Have a look at some categorization software (for instance,
>  > Mahout can do Naive Bayes categorization or some alternatives).
>  >
>  > Of course, I might be missing something in understanding what you are 
>>  asking, so feel free to give a shout back to discuss.
>  >
>>  HTH,
>>  Grant
>>
>>  On Feb 12, 2009, at 1:31 AM, renavatior wrote:
>>
>>  >
>  > > I am doing some research in vertical search? Therefore, i defined some
>>  > weights of several keywords in my corpus expressing a certain
>>  > theme,later,how can i use these to compute the similarity with the 
>>  > given web
>>  > page(passed by url to the compute method).I saw the source code of
>>  > Similarity.java in Lucene,but i do not know how to use the method 
>>  > such as
>>  > TF,IDF,and so on.
>>  > i will really appreciate it if anyone can give me some advice,thanks
>  > > in
>>  > advance.
>>  > --
>>  > View this message in context: 
>>http://www.nabble.com/How-to-compute-the-simlarity-of-a-web-page--tp21970680p21970680.html
>>  > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>  >
>>  >
>>  > ---------------------------------------------------------------------
>>  > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  > For additional commands, e-mail: java-user-help@lucene.apache.org
>>  >
>>
>>  --------------------------
>>  Grant Ingersoll
>>  http://www.lucidimagination.com/
>>
>>  Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) 
>>  using Solr/Lucene:
>>  http://www.lucidimagination.com/search
>>
>>
>>  ---------------------------------------------------------------------
>>  To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>  For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>*******************************************************************
>This e-mail is confidential. It may also be legally privileged.
>If you are not the addressee you may not copy, forward, disclose
>or use any part of it. If you have received this message in error,
>please delete it and all copies from your system and notify the
>sender immediately by return e-mail.
>
>Internet communications cannot be guaranteed to be timely,
>secure, error or virus-free. The sender does not accept liability
>for any errors or omissions.
>*******************************************************************
>"SAVE PAPER - THINK BEFORE YOU PRINT!"
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org


-- 
Ken Krugler
+1 530-210-6378

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to compute the simlarity of a web page?

Posted by Linhon <li...@tudou.com>.

wow,it sounds very nice,thank you:)

在 2009-02-16一的 22:08 -0500，Grant Ingersoll写道：
> Hmmm, you might be able to do the following:
> 
> Create a document in a memory index containing the web page
> Create a query from the keywords
> Do a search with the query against the memory index and see the score.
> 
> Alternatively, you could use the corpus statistics plus to create a  
> term vector from the document (as if it were a member of the  
> collection) and then do the cosine calculation of that document with  
> your query (which you also calculated the weights for based on your  
> collections stats)
> 
> Last, it sounds like you are essentially describing a categorization  
> task.  Have a look at some categorization software (for instance,  
> Mahout can do Naive Bayes categorization or some alternatives).
> 
> Of course, I might be missing something in understanding what you are  
> asking, so feel free to give a shout back to discuss.
> 
> HTH,
> Grant
> 
> On Feb 12, 2009, at 1:31 AM, renavatior wrote:
> 
> >
> > I am doing some research in vertical search? Therefore, i defined some
> > weights of several keywords in my corpus expressing a certain
> > theme,later,how can i use these to compute the similarity with the  
> > given web
> > page(passed by url to the compute method).I saw the source code of
> > Similarity.java in Lucene,but i do not know how to use the method  
> > such as
> > TF,IDF,and so on.
> > i will really appreciate it if anyone can give me some advice,thanks  
> > in
> > advance.
> > -- 
> > View this message in context: http://www.nabble.com/How-to-compute-the-simlarity-of-a-web-page--tp21970680p21970680.html
> > Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
> using Solr/Lucene:
> http://www.lucidimagination.com/search
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 

*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!" 



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: How to compute the simlarity of a web page?

Posted by Grant Ingersoll <gs...@apache.org>.

Hmmm, you might be able to do the following:

Create a document in a memory index containing the web page
Create a query from the keywords
Do a search with the query against the memory index and see the score.

Alternatively, you could use the corpus statistics plus to create a  
term vector from the document (as if it were a member of the  
collection) and then do the cosine calculation of that document with  
your query (which you also calculated the weights for based on your  
collections stats)

Last, it sounds like you are essentially describing a categorization  
task.  Have a look at some categorization software (for instance,  
Mahout can do Naive Bayes categorization or some alternatives).

Of course, I might be missing something in understanding what you are  
asking, so feel free to give a shout back to discuss.

HTH,
Grant

On Feb 12, 2009, at 1:31 AM, renavatior wrote:

>
> I am doing some research in vertical search? Therefore, i defined some
> weights of several keywords in my corpus expressing a certain
> theme,later,how can i use these to compute the similarity with the  
> given web
> page(passed by url to the compute method).I saw the source code of
> Similarity.java in Lucene,but i do not know how to use the method  
> such as
> TF,IDF,and so on.
> i will really appreciate it if anyone can give me some advice,thanks  
> in
> advance.
> -- 
> View this message in context: http://www.nabble.com/How-to-compute-the-simlarity-of-a-web-page--tp21970680p21970680.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org