You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Victor Lee <vi...@yahoo.com> on 2005/12/20 04:48:18 UTC

Does Search Result Show Similar Pages Like Google?

 Hi,
    Does Nutch's search result show "similar pages" like Google?  I went to Modzex.com which is using Nutch but I don't see "similar pages" in its search result.
 
 Many thanks.
 

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Does Search Result Show Similar Pages Like Google?

Posted by Doug Cutting <cu...@nutch.org>.
Victor Lee wrote:
>     Does Nutch's search result show "similar pages" like Google?  I went to Modzex.com which is using Nutch but I don't see "similar pages" in its search result.

One could use the Lucene "more-like-this" library to implement this:

http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/similarity/

Doug

Re: Does Search Result Show Similar Pages Like Google?

Posted by Jérôme Charron <je...@gmail.com>.
take a look at the clustering-carrot2 plugin.

Regards

Jérôme

On 12/21/05, Daqing Zhao <de...@gmail.com> wrote:
>
> I think clustering the documents would be a solution and just recommend
> other documents in the same cluster. Is there a clustering algorithm in
> nutch? May be very expensive to calculate.
>
> Daqing Zhao
>
>
> On 12/20/05, Victor Lee <vi...@yahoo.com> wrote:
> >
> > Getting the term vector should be easy, but when you said calculation,
> is
> > it a simple comparision of all term vectors, or is it whole another
> beast?
> >
> > Stefan Groschupf <sg...@media-style.com> wrote: No, nutch has not such a
> > functionality.
> > The quick and dirty solution to implement this would extracting the
> > term vector from the original document, calculate (there would be
> > different algorithms) somehow the most important terms for this
> > document and just do a query with these terms.
> > HTH
> > Stefan
> > P.S. Contributions are every-time welcome. :)
> > Am 20.12.2005 um 04:48 schrieb Victor Lee:
> >
> > >  Hi,
> > >     Does Nutch's search result show "similar pages" like Google?  I
> > > went to Modzex.com which is using Nutch but I don't see "similar
> > > pages" in its search result.
> > >
> > >  Many thanks.
> > >
> > >
> > > __________________________________________________
> > > Do You Yahoo!?
> > > Tired of spam?  Yahoo! Mail has the best spam protection around
> > > http://mail.yahoo.com
> >
> >
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
> >
>
>


--
http://motrech.free.fr/
http://www.frutch.org/

Re: Does Search Result Show Similar Pages Like Google?

Posted by Stefan Groschupf <sg...@media-style.com>.
Real clustering is for a web search engine impossible except of you  
have unlimited hardware resources.
However as Jerome suggest there is a search result clustering plugin.
If you are more family with Math  and algorithms you will find this  
article interesting:

http://www.stanford.edu/~taherh/papers/scalable-clustering.pdf

But as mentioned calculating the most important words of a document  
and make a new query is the way that is the best solution for now.

Stefan


Am 21.12.2005 um 00:48 schrieb Daqing Zhao:

> I think clustering the documents would be a solution and just  
> recommend
> other documents in the same cluster. Is there a clustering  
> algorithm in
> nutch? May be very expensive to calculate.
>
> Daqing Zhao
>
>
> On 12/20/05, Victor Lee <vi...@yahoo.com> wrote:
>>
>> Getting the term vector should be easy, but when you said  
>> calculation, is
>> it a simple comparision of all term vectors, or is it whole  
>> another beast?
>>
>> Stefan Groschupf <sg...@media-style.com> wrote: No, nutch has not such a
>> functionality.
>> The quick and dirty solution to implement this would extracting the
>> term vector from the original document, calculate (there would be
>> different algorithms) somehow the most important terms for this
>> document and just do a query with these terms.
>> HTH
>> Stefan
>> P.S. Contributions are every-time welcome. :)
>> Am 20.12.2005 um 04:48 schrieb Victor Lee:
>>
>>>  Hi,
>>>     Does Nutch's search result show "similar pages" like Google?  I
>>> went to Modzex.com which is using Nutch but I don't see "similar
>>> pages" in its search result.
>>>
>>>  Many thanks.
>>>
>>>
>>> __________________________________________________
>>> Do You Yahoo!?
>>> Tired of spam?  Yahoo! Mail has the best spam protection around
>>> http://mail.yahoo.com
>>
>>
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam?  Yahoo! Mail has the best spam protection around
>> http://mail.yahoo.com
>>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: Does Search Result Show Similar Pages Like Google?

Posted by Daqing Zhao <de...@gmail.com>.
I think clustering the documents would be a solution and just recommend
other documents in the same cluster. Is there a clustering algorithm in
nutch? May be very expensive to calculate.

Daqing Zhao


On 12/20/05, Victor Lee <vi...@yahoo.com> wrote:
>
> Getting the term vector should be easy, but when you said calculation, is
> it a simple comparision of all term vectors, or is it whole another beast?
>
> Stefan Groschupf <sg...@media-style.com> wrote: No, nutch has not such a
> functionality.
> The quick and dirty solution to implement this would extracting the
> term vector from the original document, calculate (there would be
> different algorithms) somehow the most important terms for this
> document and just do a query with these terms.
> HTH
> Stefan
> P.S. Contributions are every-time welcome. :)
> Am 20.12.2005 um 04:48 schrieb Victor Lee:
>
> >  Hi,
> >     Does Nutch's search result show "similar pages" like Google?  I
> > went to Modzex.com which is using Nutch but I don't see "similar
> > pages" in its search result.
> >
> >  Many thanks.
> >
> >
> > __________________________________________________
> > Do You Yahoo!?
> > Tired of spam?  Yahoo! Mail has the best spam protection around
> > http://mail.yahoo.com
>
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>

Re: Does Search Result Show Similar Pages Like Google?

Posted by Victor Lee <vi...@yahoo.com>.
Getting the term vector should be easy, but when you said calculation, is it a simple comparision of all term vectors, or is it whole another beast?

Stefan Groschupf <sg...@media-style.com> wrote: No, nutch has not such a functionality.
The quick and dirty solution to implement this would extracting the  
term vector from the original document, calculate (there would be  
different algorithms) somehow the most important terms for this  
document and just do a query with these terms.
HTH
Stefan
P.S. Contributions are every-time welcome. :)
Am 20.12.2005 um 04:48 schrieb Victor Lee:

>  Hi,
>     Does Nutch's search result show "similar pages" like Google?  I  
> went to Modzex.com which is using Nutch but I don't see "similar  
> pages" in its search result.
>
>  Many thanks.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Does Search Result Show Similar Pages Like Google?

Posted by Stefan Groschupf <sg...@media-style.com>.
No, nutch has not such a functionality.
The quick and dirty solution to implement this would extracting the  
term vector from the original document, calculate (there would be  
different algorithms) somehow the most important terms for this  
document and just do a query with these terms.
HTH
Stefan
P.S. Contributions are every-time welcome. :)
Am 20.12.2005 um 04:48 schrieb Victor Lee:

>  Hi,
>     Does Nutch's search result show "similar pages" like Google?  I  
> went to Modzex.com which is using Nutch but I don't see "similar  
> pages" in its search result.
>
>  Many thanks.
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com