You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by victor_emailbox <vi...@yahoo.com> on 2006/09/01 09:21:10 UTC

RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

Thanks.  But if I set db.ignore.external.links to false, then will it affect
the quality of the search result?  I read about Nutch, and it seems that it
does something similar to Pagelink like Google.  If so, it will affect the
quality of the search if it doesn't analyze the external links.


Vishal Shah-3 wrote:
> 
> Hello Victor,
> 
>   If I understand correctly, you want to use a seed list that contains
> some sites, and then do an internal search only on pages belonging to
> these sites. In this case, it's best not to crawl pages from other
> sites. This can be done by setting db.ignore.external.links to false in
> your nutch-site.xml. This will ensure that your crawl is only limited to
> pages from initially injected hosts.
> 
> Regards,
> 
> -vishal.
> 
> -----Original Message-----
> From: victor_emailbox [mailto:victor_emailbox@yahoo.com] 
> Sent: Thursday, August 31, 2006 10:51 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: How to Make Nutch Return Search Results Belonged to the
> Crawl URL Li
> 
> 
> No, I meant if the crawling url lists have http://www.abc.com and
> http://www.bcc.com, and both urls contains the term "hello".  bbc.com
> also
> has a link that references ccc.com which also contains the term "hello"
> but
> it is not part of the crawling url lists.
> 
> So when I do a search on "hello", will Nutch return abc.com, bcc.com and
> ccc.com in default?  If so,  how to force Nutch to return both abc.com
> and
> bcc.com without ccc.com?  
> 
> Thanks.
> 
> 
> Zaheed Haque wrote:
>> 
>> Hi
>> 
>> You mean show results from a site http://abc.com only. If so you need
>> to turn on your index-more and query-more plugins in nutch-site.xml
>> then you need to use query like  site:http://abc.com +query term or
>> url: .. I think its site not sure.
>> 
>> Cheers
>> 
>> On 8/31/06, victor_emailbox <vi...@yahoo.com> wrote:
>>>
>>> Hi,
>>>   I enter 10 urls in the url crawling list.  Nutch does its thing to
>>> fetch
>>> and index them.  How to I force Nutch to return search results that
>>> belongs
>>> to the url list?  e.g. if the url crawling list has only
>>> http://www.abc.com
>>> and http://www.bcc.com, then all search result should be under either
>>> abc.com or bbc.com, not ccc.com even if bbc.com contains links
> referring
>>> to
>>> ccc.com.
>>>
>>> Many thanks.
>>> --
>>> View this message in context:
>>>
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6072986
>>> Sent from the Nutch - User forum at Nabble.com.
>>>
>>>
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6073242
> Sent from the Nutch - User forum at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-to-the-Crawl-URL-List--tf2194391.html#a6093923
Sent from the Nutch - User forum at Nabble.com.


RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

Posted by victor_emailbox <vi...@yahoo.com>.
Thanks.
I put the following in nutch-site.xml:
<property>
  <name>db.ignore.external.links</name>
  <value>true</value>
  <description>If true, outlinks leading from a page to external hosts
  will be ignored. This is an effective way to limit the crawl to include
  only initially injected hosts, without creating complex URLFilters.
   </description>
 </property>

But somehow, when I do a search in Nutch, it still return results from other
sites.  Why?


Vishal Shah-3 wrote:
> 
> Hi Victor,
> 
>   In this case, the link analysis will be done only on the link graph
> between the URLs belonging to the hosts in your seed lists that you
> fetch. As you said, this might not give you a true idea of the link
> popularities of your URLs. On the other hand, if you set
> db.ignore.external.links to false, you will be crawling URLs outside
> your seed hosts, and it would be difficult to control the crawl for you.
> Since you are only interested in your seed list hosts, I would still
> recommend setting db.ignore.external.links to true to limit your crawl
> to your target hosts and if required change the scoring algorithm. 
> 
>   How big is your set of seed lists? Do you know how well these hosts
> link to each other? If the seed hosts belong to the same domain for
> e.g., there might be high interlinking between URLs from these hosts,
> and you should see decent results with the default ranker.
> 
> -vishal.
> 
> -----Original Message-----
> From: victor_emailbox [mailto:victor_emailbox@yahoo.com] 
> Sent: Friday, September 01, 2006 12:51 PM
> To: nutch-user@lucene.apache.org
> Subject: RE: How to Make Nutch Return Search Results Belonged to the
> Crawl URL Li
> 
> 
> Thanks.  But if I set db.ignore.external.links to false, then will it
> affect
> the quality of the search result?  I read about Nutch, and it seems that
> it
> does something similar to Pagelink like Google.  If so, it will affect
> the
> quality of the search if it doesn't analyze the external links.
> 
> 
> Vishal Shah-3 wrote:
>> 
>> Hello Victor,
>> 
>>   If I understand correctly, you want to use a seed list that contains
>> some sites, and then do an internal search only on pages belonging to
>> these sites. In this case, it's best not to crawl pages from other
>> sites. This can be done by setting db.ignore.external.links to false
> in
>> your nutch-site.xml. This will ensure that your crawl is only limited
> to
>> pages from initially injected hosts.
>> 
>> Regards,
>> 
>> -vishal.
>> 
>> -----Original Message-----
>> From: victor_emailbox [mailto:victor_emailbox@yahoo.com] 
>> Sent: Thursday, August 31, 2006 10:51 AM
>> To: nutch-user@lucene.apache.org
>> Subject: Re: How to Make Nutch Return Search Results Belonged to the
>> Crawl URL Li
>> 
>> 
>> No, I meant if the crawling url lists have http://www.abc.com and
>> http://www.bcc.com, and both urls contains the term "hello".  bbc.com
>> also
>> has a link that references ccc.com which also contains the term
> "hello"
>> but
>> it is not part of the crawling url lists.
>> 
>> So when I do a search on "hello", will Nutch return abc.com, bcc.com
> and
>> ccc.com in default?  If so,  how to force Nutch to return both abc.com
>> and
>> bcc.com without ccc.com?  
>> 
>> Thanks.
>> 
>> 
>> Zaheed Haque wrote:
>>> 
>>> Hi
>>> 
>>> You mean show results from a site http://abc.com only. If so you need
>>> to turn on your index-more and query-more plugins in nutch-site.xml
>>> then you need to use query like  site:http://abc.com +query term or
>>> url: .. I think its site not sure.
>>> 
>>> Cheers
>>> 
>>> On 8/31/06, victor_emailbox <vi...@yahoo.com> wrote:
>>>>
>>>> Hi,
>>>>   I enter 10 urls in the url crawling list.  Nutch does its thing to
>>>> fetch
>>>> and index them.  How to I force Nutch to return search results that
>>>> belongs
>>>> to the url list?  e.g. if the url crawling list has only
>>>> http://www.abc.com
>>>> and http://www.bcc.com, then all search result should be under
> either
>>>> abc.com or bbc.com, not ccc.com even if bbc.com contains links
>> referring
>>>> to
>>>> ccc.com.
>>>>
>>>> Many thanks.
>>>> --
>>>> View this message in context:
>>>>
>>
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
>> o-the-Crawl-URL-List--tf2194391.html#a6072986
>>>> Sent from the Nutch - User forum at Nabble.com.
>>>>
>>>>
>>> 
>>> 
>> 
>> -- 
>> View this message in context:
>>
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
>> o-the-Crawl-URL-List--tf2194391.html#a6073242
>> Sent from the Nutch - User forum at Nabble.com.
>> 
>> 
>> 
> 
> -- 
> View this message in context:
> http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6093923
> Sent from the Nutch - User forum at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-to-the-Crawl-URL-List--tf2194391.html#a6212829
Sent from the Nutch - User forum at Nabble.com.


RE: How to Make Nutch Return Search Results Belonged to the Crawl URL Li

Posted by Vishal Shah <vi...@rediff.co.in>.
Hi Victor,

  In this case, the link analysis will be done only on the link graph
between the URLs belonging to the hosts in your seed lists that you
fetch. As you said, this might not give you a true idea of the link
popularities of your URLs. On the other hand, if you set
db.ignore.external.links to false, you will be crawling URLs outside
your seed hosts, and it would be difficult to control the crawl for you.
Since you are only interested in your seed list hosts, I would still
recommend setting db.ignore.external.links to true to limit your crawl
to your target hosts and if required change the scoring algorithm. 

  How big is your set of seed lists? Do you know how well these hosts
link to each other? If the seed hosts belong to the same domain for
e.g., there might be high interlinking between URLs from these hosts,
and you should see decent results with the default ranker.

-vishal.

-----Original Message-----
From: victor_emailbox [mailto:victor_emailbox@yahoo.com] 
Sent: Friday, September 01, 2006 12:51 PM
To: nutch-user@lucene.apache.org
Subject: RE: How to Make Nutch Return Search Results Belonged to the
Crawl URL Li


Thanks.  But if I set db.ignore.external.links to false, then will it
affect
the quality of the search result?  I read about Nutch, and it seems that
it
does something similar to Pagelink like Google.  If so, it will affect
the
quality of the search if it doesn't analyze the external links.


Vishal Shah-3 wrote:
> 
> Hello Victor,
> 
>   If I understand correctly, you want to use a seed list that contains
> some sites, and then do an internal search only on pages belonging to
> these sites. In this case, it's best not to crawl pages from other
> sites. This can be done by setting db.ignore.external.links to false
in
> your nutch-site.xml. This will ensure that your crawl is only limited
to
> pages from initially injected hosts.
> 
> Regards,
> 
> -vishal.
> 
> -----Original Message-----
> From: victor_emailbox [mailto:victor_emailbox@yahoo.com] 
> Sent: Thursday, August 31, 2006 10:51 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: How to Make Nutch Return Search Results Belonged to the
> Crawl URL Li
> 
> 
> No, I meant if the crawling url lists have http://www.abc.com and
> http://www.bcc.com, and both urls contains the term "hello".  bbc.com
> also
> has a link that references ccc.com which also contains the term
"hello"
> but
> it is not part of the crawling url lists.
> 
> So when I do a search on "hello", will Nutch return abc.com, bcc.com
and
> ccc.com in default?  If so,  how to force Nutch to return both abc.com
> and
> bcc.com without ccc.com?  
> 
> Thanks.
> 
> 
> Zaheed Haque wrote:
>> 
>> Hi
>> 
>> You mean show results from a site http://abc.com only. If so you need
>> to turn on your index-more and query-more plugins in nutch-site.xml
>> then you need to use query like  site:http://abc.com +query term or
>> url: .. I think its site not sure.
>> 
>> Cheers
>> 
>> On 8/31/06, victor_emailbox <vi...@yahoo.com> wrote:
>>>
>>> Hi,
>>>   I enter 10 urls in the url crawling list.  Nutch does its thing to
>>> fetch
>>> and index them.  How to I force Nutch to return search results that
>>> belongs
>>> to the url list?  e.g. if the url crawling list has only
>>> http://www.abc.com
>>> and http://www.bcc.com, then all search result should be under
either
>>> abc.com or bbc.com, not ccc.com even if bbc.com contains links
> referring
>>> to
>>> ccc.com.
>>>
>>> Many thanks.
>>> --
>>> View this message in context:
>>>
>
http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6072986
>>> Sent from the Nutch - User forum at Nabble.com.
>>>
>>>
>> 
>> 
> 
> -- 
> View this message in context:
>
http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
> o-the-Crawl-URL-List--tf2194391.html#a6073242
> Sent from the Nutch - User forum at Nabble.com.
> 
> 
> 

-- 
View this message in context:
http://www.nabble.com/How-to-Make-Nutch-Return-Search-Results-Belonged-t
o-the-Crawl-URL-List--tf2194391.html#a6093923
Sent from the Nutch - User forum at Nabble.com.