You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Peter Swoboda <pr...@gmx.de> on 2006/03/03 10:27:53 UTC

how can i go deep?

Hi.
I've don a whole web crawl like it is shown in the tutorial.
There is just "http://www.kreuztal.de/" in the urls.txt
i did the Fetching three times.
But unfortunately the crawl hasn't gone deep.
while searching, i can only find keywords from the first(home-)site.
for example i couldn't find anythin on
"http://www.kreuztal.de/impressum.php"
How can i configure the depht?
Thanx for helping.

greetings
Peter

-- 
Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer!
Kostenlos downloaden: http://www.gmx.net/de/go/smartsurfer

Unable to complete fetch

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

1. I have disabled speculative tasks by setting it to false in
hadoop-site.xml

2. Now I notice that the fetcher does not complete the whole fetchlist.

3. By adding additional logging info in generate I see 20000 links being
generated, but the fetcher without any indication to any error just fetch
some 200-1000 urls.

4. Hadoop loging just writes task ... has completed.

5. I am using latest Hadoop version.

Gal.

Re: how can i go deep?

Posted by Steven Yelton <st...@missiondata.com>.

I'd be glad too, but I need to clean them up a bit (and make them more 
generic) first.  In the mean time, here is a link to an article that I 
found helpful:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

Search for 'recrawl'.  You can use this script out of the box to add 
more segments to an existing index (from the initial 'crawl' command 
perhaps).

Steven

Richard Braman wrote:

>Steven Could you share those schell scripts?
>
>-----Original Message-----
>From: Steven Yelton [mailto:steveny@missiondata.com] 
>Sent: Sunday, March 05, 2006 10:22 AM
>To: nutch-user@lucene.apache.org
>Subject: Re: how can i go deep?
>
>
>Yes!  I have abandoned the 'crawl' command for even my single site 
>searches.  I wrote shell scripts that  accomplish (generally) the same 
>tasks the crawl does.
>
>The only piece I had to watch out for is: one of the first thing the 
>'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
>behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
>'nutch-site.xml'   (these configuration parameters do things like 
>include the crawl-urlfilter.txt, pays attention to internal links, tries
>
>to not kill your host, and so on...)
>
>Steven
>
>Richard Braman wrote:
>
>  
>
>>Stefan,
>>
>>I think I know what you're saying.  When you are new to nutch and you 
>>read the tutorial,  It kind of leads you to believe (incorrectly) that 
>>whole web crawling is different from intranet crawling and that the 
>>steps are somehow different and independent of one another.  In fact it
>>    
>>
>
>  
>
>>looks like using the crawl command is somekind of consolidated way of 
>>doing each of the steps involved in whole web crawling
>>
>>I think what I didn't understand is that, you don't even have to ever 
>>use the crawl command, even if you are limiting your crawling to a 
>>limited list of URLs
>>
>>Instead you can :
>>
>>-create your list of urls (put them in a urls.txt file)\ -create the 
>>url filter, to make sure the fetcher stays within the bound of the urls
>>    
>>
>
>  
>
>>you want to crawl
>>
>>-Inject the urls into the crawl database,
>>bin/nutch inject crawl/crawldb urls.text
>>
>>-generate a fetchlist which creates a new segment
>>bin/nutch generate crawl/crawldb crawl/segments
>>
>>-fetch the segment
>>bin/nutch fetch <segmentname>
>>
>>-update the db
>>bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>>
>>-index the segment
>>bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>>
>>Then you could repeat steps from generate to index again which woud 
>>generate, fetch, update(the db of fetched segments) and index a new 
>>segment
>>
>>When you do the generate -topN parmeter generates a fetchlist based on 
>>? I think the answer is the top scoring page already in the crawldb, 
>>but I am not 100% positive.
>>
>>Rich
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Stefan Groschupf [mailto:sg@media-style.com]
>>Sent: Saturday, March 04, 2006 3:27 PM
>>To: nutch-user@lucene.apache.org
>>Subject: Re: how can i go deep?
>>
>>
>>The crawl command creates a crawlDB for each call. So as Rchard
>>mentioned try a higher depth.
>>In case you like nutch to go deeper with each iteration, try the  
>>whole web tutorial but change the url filter in a manner that it only  
>>crawls your webpage.
>>This will go as deep as much iteration you run.
>>
>>
>>Stefan
>>
>>In case you like to
>>Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>>
>> 
>>
>>    
>>
>>>Try using depth=n when you do the crawl.  Post crawl I don't know,
>>>but I
>>>have the same question.  How do you make the index go deeper when  
>>>you do
>>>your next roudn of fetching is still something I haven't figured out.
>>>
>>>-----Original Message-----
>>>From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
>>>Sent: Friday, March 03, 2006 4:28 AM
>>>To: nutch-user@lucene.apache.org
>>>Subject: how can i go deep?
>>>
>>>
>>>Hi.
>>>I've don a whole web crawl like it is shown in the tutorial. There is 
>>>just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
>>>three
>>>times. But unfortunately the crawl hasn't gone deep. while  
>>>searching, i
>>>can only find keywords from the first(home-)site. for example i  
>>>couldn't
>>>find anythin on "http://www.kreuztal.de/impressum.php"
>>>How can i configure the depht?
>>>Thanx for helping.
>>>
>>>greetings
>>>Peter
>>>
>>>--
>>>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>>downloaden: http://www.gmx.net/de/go/smartsurfer
>>>
>>>
>>>   
>>>
>>>      
>>>
>>---------------------------------------------
>>blog: http://www.find23.org
>>company: http://www.media-style.com
>>
>> 
>>
>>    
>>
>
>  
>

Re: how can i go deep?

Posted by Steven Yelton <st...@missiondata.com>.

Attached is the shell script I am using.  It can either be used to 
emulate the crawl command, or recrawl an existing index.  Feedback is 
welcomed and appreciated.

A few notes:
   * You need to change 'nutch_home' to point to your nutch installation
   * Running with no arguments will print the usage
   * Only supports the local fs

To initiate a crawl (where myurls is a file with a list of urls one per 
line)
  ./crawl.sh -initdb myurls /tmp/index

To recrawl using the same web db (updated pages or just to go deeper):
  ./crawl.sh /tmp/index


Steven

Richard Braman wrote:

>Steven Could you share those schell scripts?
>
>-----Original Message-----
>From: Steven Yelton [mailto:steveny@missiondata.com] 
>Sent: Sunday, March 05, 2006 10:22 AM
>To: nutch-user@lucene.apache.org
>Subject: Re: how can i go deep?
>
>
>Yes!  I have abandoned the 'crawl' command for even my single site 
>searches.  I wrote shell scripts that  accomplish (generally) the same 
>tasks the crawl does.
>
>The only piece I had to watch out for is: one of the first thing the 
>'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
>behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
>'nutch-site.xml'   (these configuration parameters do things like 
>include the crawl-urlfilter.txt, pays attention to internal links, tries
>
>to not kill your host, and so on...)
>
>Steven
>
>Richard Braman wrote:
>
>  
>
>>Stefan,
>>
>>I think I know what you're saying.  When you are new to nutch and you 
>>read the tutorial,  It kind of leads you to believe (incorrectly) that 
>>whole web crawling is different from intranet crawling and that the 
>>steps are somehow different and independent of one another.  In fact it
>>    
>>
>
>  
>
>>looks like using the crawl command is somekind of consolidated way of 
>>doing each of the steps involved in whole web crawling
>>
>>I think what I didn't understand is that, you don't even have to ever 
>>use the crawl command, even if you are limiting your crawling to a 
>>limited list of URLs
>>
>>Instead you can :
>>
>>-create your list of urls (put them in a urls.txt file)\ -create the 
>>url filter, to make sure the fetcher stays within the bound of the urls
>>    
>>
>
>  
>
>>you want to crawl
>>
>>-Inject the urls into the crawl database,
>>bin/nutch inject crawl/crawldb urls.text
>>
>>-generate a fetchlist which creates a new segment
>>bin/nutch generate crawl/crawldb crawl/segments
>>
>>-fetch the segment
>>bin/nutch fetch <segmentname>
>>
>>-update the db
>>bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>>
>>-index the segment
>>bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>>
>>Then you could repeat steps from generate to index again which woud 
>>generate, fetch, update(the db of fetched segments) and index a new 
>>segment
>>
>>When you do the generate -topN parmeter generates a fetchlist based on 
>>? I think the answer is the top scoring page already in the crawldb, 
>>but I am not 100% positive.
>>
>>Rich
>>
>>
>>
>>
>>
>>-----Original Message-----
>>From: Stefan Groschupf [mailto:sg@media-style.com]
>>Sent: Saturday, March 04, 2006 3:27 PM
>>To: nutch-user@lucene.apache.org
>>Subject: Re: how can i go deep?
>>
>>
>>The crawl command creates a crawlDB for each call. So as Rchard
>>mentioned try a higher depth.
>>In case you like nutch to go deeper with each iteration, try the  
>>whole web tutorial but change the url filter in a manner that it only  
>>crawls your webpage.
>>This will go as deep as much iteration you run.
>>
>>
>>Stefan
>>
>>In case you like to
>>Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>>
>> 
>>
>>    
>>
>>>Try using depth=n when you do the crawl.  Post crawl I don't know,
>>>but I
>>>have the same question.  How do you make the index go deeper when  
>>>you do
>>>your next roudn of fetching is still something I haven't figured out.
>>>
>>>-----Original Message-----
>>>From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
>>>Sent: Friday, March 03, 2006 4:28 AM
>>>To: nutch-user@lucene.apache.org
>>>Subject: how can i go deep?
>>>
>>>
>>>Hi.
>>>I've don a whole web crawl like it is shown in the tutorial. There is 
>>>just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
>>>three
>>>times. But unfortunately the crawl hasn't gone deep. while  
>>>searching, i
>>>can only find keywords from the first(home-)site. for example i  
>>>couldn't
>>>find anythin on "http://www.kreuztal.de/impressum.php"
>>>How can i configure the depht?
>>>Thanx for helping.
>>>
>>>greetings
>>>Peter
>>>
>>>--
>>>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>>downloaden: http://www.gmx.net/de/go/smartsurfer
>>>
>>>
>>>   
>>>
>>>      
>>>
>>---------------------------------------------
>>blog: http://www.find23.org
>>company: http://www.media-style.com
>>
>> 
>>
>>    
>>
>
>  
>

Re: how can i go deep?

Posted by Jeff Ritchie <jr...@netwurklabs.com>.

Here is an automated Perl Script... Could be ported to a shell.

No failure checking however I should check for failures....... but i'm 
usually keeping an eye on things...
nutch-runner.pl
#!/usr/bin/perl

while (1) {
$trash = `/opt/nutch/bin/nutch generate /mycrawl/crawldb 
/mycrawl/segments -topN 10000`;

sleep 5;

@lines = split(/\n/,`/opt/nutch/bin/hadoop dfs -ls /mycrawl/segments`);
$seg = $lines[@lines-1];
@segs = split(/\t/,$seg);
$seg = $segs[0];
@segshorts = split(/\//,$seg);
$segshort = $segshorts[@segshorts-1];

$trash = `/opt/nutch/bin/nutch fetch $seg`;

sleep 5;

$trash = `/opt/nutch/bin/nutch updatedb /mycrawl/crawldb $seg`;

sleep 5;

$trash = `/opt/nutch/bin/nutch invertlinks /mycrawl/linkdb $seg`;

sleep 5;

# Removed to do a big index later.....
#$trash = `/opt/nutch/bin/nutch index /mycrawl/index-$segshort 
/mycrawl/crawldb /mycrawl/linkdb $seg`;

#sleep 5;
}

the 'big index'

big-index.pl
#!/usr/bin/perl

@lines = split(/\n/,`/opt/nutch/bin/hadoop dfs -ls /mycrawl/segments`);
foreach $line (@lines) {
if ($line =~ /dir/) {
@segs = split(/\t/,$line);
$seg = $segs[0];
$allsegs = $allsegs . " " . $seg;
}
}
`/opt/nutch/bin/nutch index /mycrawl/indexes /mycrawl/crawldb 
/mycrawl/linkdb $allsegs`;



Richard Braman wrote:
> Steven Could you share those schell scripts?
>
> -----Original Message-----
> From: Steven Yelton [mailto:steveny@missiondata.com] 
> Sent: Sunday, March 05, 2006 10:22 AM
> To: nutch-user@lucene.apache.org
> Subject: Re: how can i go deep?
>
>
> Yes!  I have abandoned the 'crawl' command for even my single site 
> searches.  I wrote shell scripts that  accomplish (generally) the same 
> tasks the crawl does.
>
> The only piece I had to watch out for is: one of the first thing the 
> 'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
> behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
> 'nutch-site.xml'   (these configuration parameters do things like 
> include the crawl-urlfilter.txt, pays attention to internal links, tries
>
> to not kill your host, and so on...)
>
> Steven
>
> Richard Braman wrote:
>
>   
>> Stefan,
>>
>> I think I know what you're saying.  When you are new to nutch and you 
>> read the tutorial,  It kind of leads you to believe (incorrectly) that 
>> whole web crawling is different from intranet crawling and that the 
>> steps are somehow different and independent of one another.  In fact it
>>     
>
>   
>> looks like using the crawl command is somekind of consolidated way of 
>> doing each of the steps involved in whole web crawling
>>
>> I think what I didn't understand is that, you don't even have to ever 
>> use the crawl command, even if you are limiting your crawling to a 
>> limited list of URLs
>>
>> Instead you can :
>>
>> -create your list of urls (put them in a urls.txt file)\ -create the 
>> url filter, to make sure the fetcher stays within the bound of the urls
>>     
>
>   
>> you want to crawl
>>
>> -Inject the urls into the crawl database,
>> bin/nutch inject crawl/crawldb urls.text
>>
>> -generate a fetchlist which creates a new segment
>> bin/nutch generate crawl/crawldb crawl/segments
>>
>> -fetch the segment
>> bin/nutch fetch <segmentname>
>>
>> -update the db
>> bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>>
>> -index the segment
>> bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>>
>> Then you could repeat steps from generate to index again which woud 
>> generate, fetch, update(the db of fetched segments) and index a new 
>> segment
>>
>> When you do the generate -topN parmeter generates a fetchlist based on 
>> ? I think the answer is the top scoring page already in the crawldb, 
>> but I am not 100% positive.
>>
>> Rich
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Stefan Groschupf [mailto:sg@media-style.com]
>> Sent: Saturday, March 04, 2006 3:27 PM
>> To: nutch-user@lucene.apache.org
>> Subject: Re: how can i go deep?
>>
>>
>> The crawl command creates a crawlDB for each call. So as Rchard
>> mentioned try a higher depth.
>> In case you like nutch to go deeper with each iteration, try the  
>> whole web tutorial but change the url filter in a manner that it only  
>> crawls your webpage.
>> This will go as deep as much iteration you run.
>>
>>
>> Stefan
>>
>> In case you like to
>> Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>>
>>  
>>
>>     
>>> Try using depth=n when you do the crawl.  Post crawl I don't know,
>>> but I
>>> have the same question.  How do you make the index go deeper when  
>>> you do
>>> your next roudn of fetching is still something I haven't figured out.
>>>
>>> -----Original Message-----
>>> From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
>>> Sent: Friday, March 03, 2006 4:28 AM
>>> To: nutch-user@lucene.apache.org
>>> Subject: how can i go deep?
>>>
>>>
>>> Hi.
>>> I've don a whole web crawl like it is shown in the tutorial. There is 
>>> just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
>>> three
>>> times. But unfortunately the crawl hasn't gone deep. while  
>>> searching, i
>>> can only find keywords from the first(home-)site. for example i  
>>> couldn't
>>> find anythin on "http://www.kreuztal.de/impressum.php"
>>> How can i configure the depht?
>>> Thanx for helping.
>>>
>>> greetings
>>> Peter
>>>
>>> --
>>> Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>> downloaden: http://www.gmx.net/de/go/smartsurfer
>>>
>>>
>>>    
>>>
>>>       
>> ---------------------------------------------
>> blog: http://www.find23.org
>> company: http://www.media-style.com
>>
>>  
>>
>>     
>
>
>

RE: how can i go deep?

Posted by Richard Braman <rb...@bramantax.com>.

Steven Could you share those schell scripts?

-----Original Message-----
From: Steven Yelton [mailto:steveny@missiondata.com] 
Sent: Sunday, March 05, 2006 10:22 AM
To: nutch-user@lucene.apache.org
Subject: Re: how can i go deep?


Yes!  I have abandoned the 'crawl' command for even my single site 
searches.  I wrote shell scripts that  accomplish (generally) the same 
tasks the crawl does.

The only piece I had to watch out for is: one of the first thing the 
'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
'nutch-site.xml'   (these configuration parameters do things like 
include the crawl-urlfilter.txt, pays attention to internal links, tries

to not kill your host, and so on...)

Steven

Richard Braman wrote:

>Stefan,
>
>I think I know what you're saying.  When you are new to nutch and you 
>read the tutorial,  It kind of leads you to believe (incorrectly) that 
>whole web crawling is different from intranet crawling and that the 
>steps are somehow different and independent of one another.  In fact it

>looks like using the crawl command is somekind of consolidated way of 
>doing each of the steps involved in whole web crawling
>
>I think what I didn't understand is that, you don't even have to ever 
>use the crawl command, even if you are limiting your crawling to a 
>limited list of URLs
>
>Instead you can :
>
>-create your list of urls (put them in a urls.txt file)\ -create the 
>url filter, to make sure the fetcher stays within the bound of the urls

>you want to crawl
>
>-Inject the urls into the crawl database,
>bin/nutch inject crawl/crawldb urls.text
>
>-generate a fetchlist which creates a new segment
>bin/nutch generate crawl/crawldb crawl/segments
>
>-fetch the segment
>bin/nutch fetch <segmentname>
>
>-update the db
>bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>
>-index the segment
>bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>
>Then you could repeat steps from generate to index again which woud 
>generate, fetch, update(the db of fetched segments) and index a new 
>segment
>
>When you do the generate -topN parmeter generates a fetchlist based on 
>? I think the answer is the top scoring page already in the crawldb, 
>but I am not 100% positive.
>
>Rich
>
>
>
>
>
>-----Original Message-----
>From: Stefan Groschupf [mailto:sg@media-style.com]
>Sent: Saturday, March 04, 2006 3:27 PM
>To: nutch-user@lucene.apache.org
>Subject: Re: how can i go deep?
>
>
>The crawl command creates a crawlDB for each call. So as Rchard
>mentioned try a higher depth.
>In case you like nutch to go deeper with each iteration, try the  
>whole web tutorial but change the url filter in a manner that it only  
>crawls your webpage.
>This will go as deep as much iteration you run.
>
>
>Stefan
>
>In case you like to
>Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>
>  
>
>>Try using depth=n when you do the crawl.  Post crawl I don't know,
>>but I
>>have the same question.  How do you make the index go deeper when  
>>you do
>>your next roudn of fetching is still something I haven't figured out.
>>
>>-----Original Message-----
>>From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
>>Sent: Friday, March 03, 2006 4:28 AM
>>To: nutch-user@lucene.apache.org
>>Subject: how can i go deep?
>>
>>
>>Hi.
>>I've don a whole web crawl like it is shown in the tutorial. There is 
>>just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
>>three
>>times. But unfortunately the crawl hasn't gone deep. while  
>>searching, i
>>can only find keywords from the first(home-)site. for example i  
>>couldn't
>>find anythin on "http://www.kreuztal.de/impressum.php"
>>How can i configure the depht?
>>Thanx for helping.
>>
>>greetings
>>Peter
>>
>>--
>>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>downloaden: http://www.gmx.net/de/go/smartsurfer
>>
>>
>>    
>>
>
>---------------------------------------------
>blog: http://www.find23.org
>company: http://www.media-style.com
>
>  
>

Re: how can i go deep?

Posted by Steven Yelton <st...@missiondata.com>.

Yes!  I have abandoned the 'crawl' command for even my single site 
searches.  I wrote shell scripts that  accomplish (generally) the same 
tasks the crawl does.

The only piece I had to watch out for is: one of the first thing the 
'crawl' class does is load 'crawl-tool.xml'.  So to get the exact same 
behavior I cut and pasted the contents of 'crawl-tool.xml' into my 
'nutch-site.xml'   (these configuration parameters do things like 
include the crawl-urlfilter.txt, pays attention to internal links, tries 
to not kill your host, and so on...)

Steven

Richard Braman wrote:

>Stefan,
>
>I think I know what you're saying.  When you are new to nutch and you
>read the tutorial,  It kind of leads you to believe (incorrectly) that
>whole web crawling is different from intranet crawling and that the
>steps are somehow different and independent of one another.  In fact it
>looks like using the crawl command is somekind of consolidated way of
>doing each of the steps involved in whole web crawling
>
>I think what I didn't understand is that, you don't even have to ever
>use the crawl command, even if you are limiting your crawling to a
>limited list of URLs
>
>Instead you can :
>
>-create your list of urls (put them in a urls.txt file)\
>-create the url filter, to make sure the fetcher stays within the bound
>of the urls you want to crawl
>
>-Inject the urls into the crawl database, 
>bin/nutch inject crawl/crawldb urls.text
>
>-generate a fetchlist which creates a new segment
>bin/nutch generate crawl/crawldb crawl/segments
>
>-fetch the segment
>bin/nutch fetch <segmentname>
>
>-update the db
>bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>
>
>-index the segment
>bin/nutch index crawl/indexdb crawl/segments/<segmentname>
>
>Then you could repeat steps from generate to index again which woud
>generate, fetch, update(the db of fetched segments) and index a new
>segment
>
>When you do the generate -topN parmeter generates a fetchlist based on ?
>I think the answer is the top scoring page already in the crawldb, but I
>am not 100% positive.
>
>Rich
>
>
>
>
>
>-----Original Message-----
>From: Stefan Groschupf [mailto:sg@media-style.com] 
>Sent: Saturday, March 04, 2006 3:27 PM
>To: nutch-user@lucene.apache.org
>Subject: Re: how can i go deep?
>
>
>The crawl command creates a crawlDB for each call. So as Rchard  
>mentioned try a higher depth.
>In case you like nutch to go deeper with each iteration, try the  
>whole web tutorial but change the url filter in a manner that it only  
>crawls your webpage.
>This will go as deep as much iteration you run.
>
>
>Stefan
>
>In case you like to
>Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:
>
>  
>
>>Try using depth=n when you do the crawl.  Post crawl I don't know,
>>but I
>>have the same question.  How do you make the index go deeper when  
>>you do
>>your next roudn of fetching is still something I haven't figured out.
>>
>>-----Original Message-----
>>From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
>>Sent: Friday, March 03, 2006 4:28 AM
>>To: nutch-user@lucene.apache.org
>>Subject: how can i go deep?
>>
>>
>>Hi.
>>I've don a whole web crawl like it is shown in the tutorial. There is 
>>just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
>>three
>>times. But unfortunately the crawl hasn't gone deep. while  
>>searching, i
>>can only find keywords from the first(home-)site. for example i  
>>couldn't
>>find anythin on "http://www.kreuztal.de/impressum.php"
>>How can i configure the depht?
>>Thanx for helping.
>>
>>greetings
>>Peter
>>
>>--
>>Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
>>downloaden: http://www.gmx.net/de/go/smartsurfer
>>
>>
>>    
>>
>
>---------------------------------------------
>blog: http://www.find23.org
>company: http://www.media-style.com
>
>  
>

RE: how can i go deep?

Posted by Richard Braman <rb...@bramantax.com>.

Stefan,

I think I know what you're saying.  When you are new to nutch and you
read the tutorial,  It kind of leads you to believe (incorrectly) that
whole web crawling is different from intranet crawling and that the
steps are somehow different and independent of one another.  In fact it
looks like using the crawl command is somekind of consolidated way of
doing each of the steps involved in whole web crawling

I think what I didn't understand is that, you don't even have to ever
use the crawl command, even if you are limiting your crawling to a
limited list of URLs

Instead you can :

-create your list of urls (put them in a urls.txt file)\
-create the url filter, to make sure the fetcher stays within the bound
of the urls you want to crawl

-Inject the urls into the crawl database, 
bin/nutch inject crawl/crawldb urls.text

-generate a fetchlist which creates a new segment
bin/nutch generate crawl/crawldb crawl/segments

-fetch the segment
bin/nutch fetch <segmentname>

-update the db
bin/nutch updatedb crawl/crawldb crawl/segments/<segmentname>

-index the segment
bin/nutch index crawl/indexdb crawl/segments/<segmentname>

Then you could repeat steps from generate to index again which woud
generate, fetch, update(the db of fetched segments) and index a new
segment

When you do the generate -topN parmeter generates a fetchlist based on ?
I think the answer is the top scoring page already in the crawldb, but I
am not 100% positive.

Rich





-----Original Message-----
From: Stefan Groschupf [mailto:sg@media-style.com] 
Sent: Saturday, March 04, 2006 3:27 PM
To: nutch-user@lucene.apache.org
Subject: Re: how can i go deep?


The crawl command creates a crawlDB for each call. So as Rchard  
mentioned try a higher depth.
In case you like nutch to go deeper with each iteration, try the  
whole web tutorial but change the url filter in a manner that it only  
crawls your webpage.
This will go as deep as much iteration you run.


Stefan

In case you like to
Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:

> Try using depth=n when you do the crawl.  Post crawl I don't know,
> but I
> have the same question.  How do you make the index go deeper when  
> you do
> your next roudn of fetching is still something I haven't figured out.
>
> -----Original Message-----
> From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
> Sent: Friday, March 03, 2006 4:28 AM
> To: nutch-user@lucene.apache.org
> Subject: how can i go deep?
>
>
> Hi.
> I've don a whole web crawl like it is shown in the tutorial. There is 
> just "http://www.kreuztal.de/" in the urls.txt i did the Fetching
> three
> times. But unfortunately the crawl hasn't gone deep. while  
> searching, i
> can only find keywords from the first(home-)site. for example i  
> couldn't
> find anythin on "http://www.kreuztal.de/impressum.php"
> How can i configure the depht?
> Thanx for helping.
>
> greetings
> Peter
>
> --
> Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
> downloaden: http://www.gmx.net/de/go/smartsurfer
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

Re: how can i go deep?

Posted by Stefan Groschupf <sg...@media-style.com>.

The crawl command creates a crawlDB for each call. So as Rchard  
mentioned try a higher depth.
In case you like nutch to go deeper with each iteration, try the  
whole web tutorial but change the url filter in a manner that it only  
crawls your webpage.
This will go as deep as much iteration you run.


Stefan

In case you like to
Am Mar 4, 2006 um 9:09 PM schrieb Richard Braman:

> Try using depth=n when you do the crawl.  Post crawl I don't know,  
> but I
> have the same question.  How do you make the index go deeper when  
> you do
> your next roudn of fetching is still something I haven't figured out.
>
> -----Original Message-----
> From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de]
> Sent: Friday, March 03, 2006 4:28 AM
> To: nutch-user@lucene.apache.org
> Subject: how can i go deep?
>
>
> Hi.
> I've don a whole web crawl like it is shown in the tutorial. There is
> just "http://www.kreuztal.de/" in the urls.txt i did the Fetching  
> three
> times. But unfortunately the crawl hasn't gone deep. while  
> searching, i
> can only find keywords from the first(home-)site. for example i  
> couldn't
> find anythin on "http://www.kreuztal.de/impressum.php"
> How can i configure the depht?
> Thanx for helping.
>
> greetings
> Peter
>
> -- 
> Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
> downloaden: http://www.gmx.net/de/go/smartsurfer
>
>

---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com

RE: how can i go deep?

Posted by Richard Braman <rb...@bramantax.com>.

Try using depth=n when you do the crawl.  Post crawl I don't know, but I
have the same question.  How do you make the index go deeper when you do
your next roudn of fetching is still something I haven't figured out.

-----Original Message-----
From: Peter Swoboda [mailto:projektarbeit_peter@gmx.de] 
Sent: Friday, March 03, 2006 4:28 AM
To: nutch-user@lucene.apache.org
Subject: how can i go deep?

Hi.
I've don a whole web crawl like it is shown in the tutorial. There is
just "http://www.kreuztal.de/" in the urls.txt i did the Fetching three
times. But unfortunately the crawl hasn't gone deep. while searching, i
can only find keywords from the first(home-)site. for example i couldn't
find anythin on "http://www.kreuztal.de/impressum.php"
How can i configure the depht?
Thanx for helping.

greetings
Peter

-- 
Bis zu 70% Ihrer Onlinekosten sparen: GMX SmartSurfer! Kostenlos
downloaden: http://www.gmx.net/de/go/smartsurfer