You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Otis Gospodnetic <og...@yahoo.com> on 2008/06/22 22:13:11 UTC

Fetching only unfetched URLs

Hi,

If there an existing method for generating a segment/fetchlist containing only URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old" CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched" status if you run -stats) and in such a situation a person may prefer to fetch only the yet-unfetched URLs first, and only after that include URLs that need to be refetched in the newly generated segments.

One can write a custom Generator, or perhaps modify the existing one to add this option, but is there an existing mechanism for this?

If not, does this sound like something that should be added to the existing Generator and invoked via a command-line arg, say -unfetchedOnly ?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Error starting Nutch-0.9 in Tomcat 5

Posted by Winton Davies <wd...@cs.stanford.edu>.

Any ideas - I'm on Fedora, tomcat5, Java 1.7

I copied the nutch-0.9.war file directly into the webapps directory as ROOT.war

Had no problems installing on MacosX. No idea what's wrong, some research
talks about not having a compiler, but I dont think thats the issue.

Winton

An error occurred at line: 28 in the jsp file: /index.jsp
The method write(char) is undefined for the type JspWriter
25:   String requestURI = HttpUtils.getRequestURL(request).toString();
26:   String base = requestURI.substring(0, requestURI.lastIndexOf('/'));
27:   response.sendRedirect(language + "/");
28: %>



type Exception report
message
description The server encountered an internal error () that 
prevented it from fulfilling this request.
exception
org.apache.jasper.JasperException: Unable to compile class for JSP:

........ <lots of repetition>


An error occurred at line: 52 in the generated java file
Throwable cannot be resolved to a type


An error occurred at line: 53 in the generated java file
t cannot be resolved


An error occurred at line: 57 in the generated java file
t cannot be resolved

Stacktrace:
	org.apache.jasper.compiler.DefaultErrorHandler.javacError(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.compiler.ErrorDispatcher.javacError(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.compiler.JDTCompiler.generateClass(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.compiler.Compiler.compile(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.compiler.Compiler.compile(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.compiler.Compiler.compile(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.JspCompilationContext.compile(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.servlet.JspServletWrapper.service(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.servlet.JspServlet.serviceJspFile(jasper5-compiler-5.5.26.jar.so)
	org.apache.jasper.servlet.JspServlet.service(jasper5-compiler-5.5.26.jar.so)
	javax.servlet.http.HttpServlet.service(tomcat5-servlet-2.4-api-5.5.26.jar.so)

Re: Fetching only unfetched URLs

Posted by John Martyniak <jo...@beforedawn.com>.

Ian,

I am pretty new to Nutch myself.  And I think that the unfetched URLs  
are what Dennis is going to look into.

There are two main ways of fetching URLs, one is to use bin/fetch  
crawl which handles all of the individual steps of getting URLs.

The second way is to run the bin/nutch generate, bin/nutch fetch, bin/ 
nutch updatedb, bin/nutch index commands to do all of the steps by  
hand (or by program).  I think that they all run the same stuff, but  
the running the commands individually is better suited to Whole Web  
Crawling, whereas the bin/nutch crawl is better suited for intranet or  
enterprise crawling.

Hope this helps.

-John


On Dec 3, 2008, at 9:15 AM, Ian.huang wrote:

> hi, John
>
> I am a newbie of nutch.
>
> Can you tell me, How to deal with un-fetched url? If I run a recrawl  
> script, will un-fetched urls be handled? How about other fetched  
> url? Will them updated or refetch as well?
>
> Is generate-fetch-update methodology means to run a new crawler and  
> merge with older one?
>
> Thanks
> ian
>
> --------------------------------------------------
> From: "John Martyniak" <jo...@beforedawn.com>
> Sent: Thursday, December 04, 2008 2:01 PM
> To: <nu...@lucene.apache.org>
> Subject: Re: Fetching only unfetched URLs
>
>> I think that this would be another good piece of functionality.  As  
>> I would like to continue to use the generate-fetch-update  
>> methodology  but would like to mimic the functionality of Crawl, in  
>> that I can grab  every page at a specific domain.
>>
>> -John
>>
>> On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
>>
>>>
>>>
>>> Otis Gospodnetic wrote:
>>>> Hi,
>>>> If there an existing method for generating a segment/fetchlist  
>>>> containing only URLs that have not yet been fetched?
>>>> I'm asking because I can imagine a situation where one has a  
>>>> large  and "old" CrawlDb that "knows" about a lot of URLs (the  
>>>> ones with "db_unfetched" status if you run -stats) and in such a  
>>>> situation a person may prefer to fetch only the yet-unfetched  
>>>> URLs first, and  only after that include URLs that need to be  
>>>> refetched in the newly generated segments.
>>>
>>> I don't think a current method exists to do only unfetched URLs,  
>>> but  it does sound like an interesting bit of functionality.
>>>
>>>> One can write a custom Generator, or perhaps modify the existing   
>>>> one to add this option, but is there an existing mechanism for  
>>>> this?
>>>
>>> Generator would probably be best, let me look into what it would   
>>> take to do this.  Maybe we can get it into 1.0.
>>>
>>> Dennis
>>>
>>>> If not, does this sound like something that should be added to  
>>>> the existing Generator and invoked via a command-line arg, say -  
>>>> unfetchedOnly ?
>>>> Thanks,
>>>> Otis
>>>> --
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>

Re: Fetching only unfetched URLs

Posted by "Ian.huang" <yi...@hotmail.com>.

Thank you, Dennis and John

I successfully crawled a website, and got log as follows

2008-12-02 21:31:13,853 INFO  crawl.CrawlDbReader - TOTAL urls:	27274
....
2008-12-02 21:31:13,856 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):155682008-12-02 21:31:13,856 INFO  crawl.CrawlDbReader - status 2 (db_fetched): 
4393

In here, I have two difference purposes. I need to recraw those urls which 
are not finished(unfetched url) before, also want to inject some new urls to 
be unfetched urls.

So, I used recrawl.sh on wiki to do this recrawling job. I noticed the final 
step is merging new generated index with old index. The logs regarding 
mering index as follows:

2008-12-03 09:41:21,118 INFO  indexer.IndexingFilters - Adding 
org.apache.nutch.indexer.basic.BasicIndexingFilter
2008-12-03 09:41:21,371 INFO  indexer.Indexer - Optimizing index.
2008-12-03 09:41:22,414 INFO  indexer.Indexer - Indexer: done
2008-12-03 09:41:25,543 INFO  indexer.DeleteDuplicates - Dedup: starting
2008-12-03 09:41:25,620 INFO  indexer.DeleteDuplicates - Dedup: adding 
indexes in: c3/newindexes
2008-12-03 09:41:31,462 INFO  indexer.DeleteDuplicates - Dedup: done
2008-12-03 09:41:34,599 INFO  indexer.IndexMerger - merging indexes to: 
c3/index
2008-12-03 09:41:34,618 INFO  indexer.IndexMerger - Adding 
c3/newindexes/part-00000
2008-12-03 09:41:34,833 INFO  indexer.IndexMerger - done merging

I saw a merge-output folder added into index.. But for index data, nothing 
happened. In addtional, I am sure that many unfetched urls are fecthed and 
indexed.

Can you tell me what happended? Am I missing something?

In addtion, Can I set db.max.outlinks.per.page to -1, and generate no 
unfetched urls during crawling? I do not want any pages are missed :)

Thank you very much

Ian


--------------------------------------------------
From: "Dennis Kubes" <ku...@apache.org>
Sent: Thursday, December 04, 2008 6:58 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetching only unfetched URLs

>
> It depends what you mean by unfetched url.  There are three basic types of 
> unfetched urls.
>
> 1) The new urls that we parse off a webpage during fetching/parsing and 
> that are added to the CrawlDb
> 2) Redirected urls that are not immediately fetched.  If the 
> http.redirect.max config variable in nutch-*.xml is set to 0 then any 
> redirect is queued to be fetched during the next fetching round similar to 
> new urls we parse off of a webpage.
> 3) Urls that have crossed their fetching expiration date in crawldb and 
> will be queued for refetching.
>
> In Nutch there really isn't the concept of re-crawling where you would 
> update *only* certain urls.  There are the concepts of fetching. merging, 
> and queuing urls for fetching.  When we talk about the generate fetch 
> update cycle we are talking about running multiple fetch (i.e. crawl) 
> cycles.  Each of these produces a segments.  URLs can be parsed from those 
> segments and inserted/updated into the CrawlDb  The CrawlDb is used to 
> generate new lists of urls to fetch.  And the process starts all over 
> again.
>
> Segments can be merged (and then indexed) together.  The CrawlDb is global 
> to all segments (although multiple crawldbs can be merged).  URLs in the 
> CrawlDb that have been successfully fetched, have a last fetched time and 
> different FetchSchedule implementations determine when the correct time is 
> to re-fetch those urls.  URLs that have not been fetched should be 
> available for fetching immediately.  URLs that have been attempted to be 
> fetched and errored are only an increasing scale for when the next fetch 
> attempt should be.
>
> Dennis
>
>
> Ian.huang wrote:
>> hi, John
>>
>> I am a newbie of nutch.
>>
>> Can you tell me, How to deal with un-fetched url? If I run a recrawl 
>> script, will un-fetched urls be handled? How about other fetched url? 
>> Will them updated or refetch as well?
>>
>> Is generate-fetch-update methodology means to run a new crawler and merge 
>> with older one?
>>
>> Thanks
>> ian
>>
>> --------------------------------------------------
>> From: "John Martyniak" <jo...@beforedawn.com>
>> Sent: Thursday, December 04, 2008 2:01 PM
>> To: <nu...@lucene.apache.org>
>> Subject: Re: Fetching only unfetched URLs
>>
>>> I think that this would be another good piece of functionality.  As I 
>>> would like to continue to use the generate-fetch-update methodology  but 
>>> would like to mimic the functionality of Crawl, in that I can grab 
>>> every page at a specific domain.
>>>
>>> -John
>>>
>>> On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
>>>
>>>>
>>>>
>>>> Otis Gospodnetic wrote:
>>>>> Hi,
>>>>> If there an existing method for generating a segment/fetchlist 
>>>>> containing only URLs that have not yet been fetched?
>>>>> I'm asking because I can imagine a situation where one has a large 
>>>>> and "old" CrawlDb that "knows" about a lot of URLs (the ones with 
>>>>> "db_unfetched" status if you run -stats) and in such a situation a 
>>>>> person may prefer to fetch only the yet-unfetched URLs first, and 
>>>>> only after that include URLs that need to be refetched in the newly 
>>>>> generated segments.
>>>>
>>>> I don't think a current method exists to do only unfetched URLs, but 
>>>> it does sound like an interesting bit of functionality.
>>>>
>>>>> One can write a custom Generator, or perhaps modify the existing  one 
>>>>> to add this option, but is there an existing mechanism for this?
>>>>
>>>> Generator would probably be best, let me look into what it would  take 
>>>> to do this.  Maybe we can get it into 1.0.
>>>>
>>>> Dennis
>>>>
>>>>> If not, does this sound like something that should be added to the 
>>>>> existing Generator and invoked via a command-line arg, say - 
>>>>> unfetchedOnly ?
>>>>> Thanks,
>>>>> Otis
>>>>> -- 
>>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>>
>>>
>

Re: Fetching only unfetched URLs

Posted by Dennis Kubes <ku...@apache.org>.

It depends what you mean by unfetched url.  There are three basic types 
of unfetched urls.

1) The new urls that we parse off a webpage during fetching/parsing and 
that are added to the CrawlDb
2) Redirected urls that are not immediately fetched.  If the 
http.redirect.max config variable in nutch-*.xml is set to 0 then any 
redirect is queued to be fetched during the next fetching round similar 
to new urls we parse off of a webpage.
3) Urls that have crossed their fetching expiration date in crawldb and 
will be queued for refetching.

In Nutch there really isn't the concept of re-crawling where you would 
update *only* certain urls.  There are the concepts of fetching. 
merging, and queuing urls for fetching.  When we talk about the generate 
fetch update cycle we are talking about running multiple fetch (i.e. 
crawl) cycles.  Each of these produces a segments.  URLs can be parsed 
from those segments and inserted/updated into the CrawlDb  The CrawlDb 
is used to generate new lists of urls to fetch.  And the process starts 
all over again.

Segments can be merged (and then indexed) together.  The CrawlDb is 
global to all segments (although multiple crawldbs can be merged).  URLs 
in the CrawlDb that have been successfully fetched, have a last fetched 
time and different FetchSchedule implementations determine when the 
correct time is to re-fetch those urls.  URLs that have not been fetched 
should be available for fetching immediately.  URLs that have been 
attempted to be fetched and errored are only an increasing scale for 
when the next fetch attempt should be.

Dennis

Ian.huang wrote:
> hi, John
> 
> I am a newbie of nutch.
> 
> Can you tell me, How to deal with un-fetched url? If I run a recrawl 
> script, will un-fetched urls be handled? How about other fetched url? 
> Will them updated or refetch as well?
> 
> Is generate-fetch-update methodology means to run a new crawler and 
> merge with older one?
> 
> Thanks
> ian
> 
> --------------------------------------------------
> From: "John Martyniak" <jo...@beforedawn.com>
> Sent: Thursday, December 04, 2008 2:01 PM
> To: <nu...@lucene.apache.org>
> Subject: Re: Fetching only unfetched URLs
> 
>> I think that this would be another good piece of functionality.  As I 
>> would like to continue to use the generate-fetch-update methodology  
>> but would like to mimic the functionality of Crawl, in that I can 
>> grab  every page at a specific domain.
>>
>> -John
>>
>> On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
>>
>>>
>>>
>>> Otis Gospodnetic wrote:
>>>> Hi,
>>>> If there an existing method for generating a segment/fetchlist 
>>>> containing only URLs that have not yet been fetched?
>>>> I'm asking because I can imagine a situation where one has a large  
>>>> and "old" CrawlDb that "knows" about a lot of URLs (the ones with 
>>>> "db_unfetched" status if you run -stats) and in such a situation a 
>>>> person may prefer to fetch only the yet-unfetched URLs first, and  
>>>> only after that include URLs that need to be refetched in the newly 
>>>> generated segments.
>>>
>>> I don't think a current method exists to do only unfetched URLs, but  
>>> it does sound like an interesting bit of functionality.
>>>
>>>> One can write a custom Generator, or perhaps modify the existing  
>>>> one to add this option, but is there an existing mechanism for this?
>>>
>>> Generator would probably be best, let me look into what it would  
>>> take to do this.  Maybe we can get it into 1.0.
>>>
>>> Dennis
>>>
>>>> If not, does this sound like something that should be added to the 
>>>> existing Generator and invoked via a command-line arg, say - 
>>>> unfetchedOnly ?
>>>> Thanks,
>>>> Otis
>>>> -- 
>>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>

Re: Fetching only unfetched URLs

Posted by "Ian.huang" <yi...@hotmail.com>.

hi, John

I am a newbie of nutch.

Can you tell me, How to deal with un-fetched url? If I run a recrawl script, 
will un-fetched urls be handled? How about other fetched url? Will them 
updated or refetch as well?

Is generate-fetch-update methodology means to run a new crawler and merge 
with older one?

Thanks
ian

--------------------------------------------------
From: "John Martyniak" <jo...@beforedawn.com>
Sent: Thursday, December 04, 2008 2:01 PM
To: <nu...@lucene.apache.org>
Subject: Re: Fetching only unfetched URLs

> I think that this would be another good piece of functionality.  As I 
> would like to continue to use the generate-fetch-update methodology  but 
> would like to mimic the functionality of Crawl, in that I can grab  every 
> page at a specific domain.
>
> -John
>
> On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:
>
>>
>>
>> Otis Gospodnetic wrote:
>>> Hi,
>>> If there an existing method for generating a segment/fetchlist 
>>> containing only URLs that have not yet been fetched?
>>> I'm asking because I can imagine a situation where one has a large  and 
>>> "old" CrawlDb that "knows" about a lot of URLs (the ones with 
>>> "db_unfetched" status if you run -stats) and in such a situation a 
>>> person may prefer to fetch only the yet-unfetched URLs first, and  only 
>>> after that include URLs that need to be refetched in the newly 
>>> generated segments.
>>
>> I don't think a current method exists to do only unfetched URLs, but  it 
>> does sound like an interesting bit of functionality.
>>
>>> One can write a custom Generator, or perhaps modify the existing  one to 
>>> add this option, but is there an existing mechanism for this?
>>
>> Generator would probably be best, let me look into what it would  take to 
>> do this.  Maybe we can get it into 1.0.
>>
>> Dennis
>>
>>> If not, does this sound like something that should be added to the 
>>> existing Generator and invoked via a command-line arg, say - 
>>> unfetchedOnly ?
>>> Thanks,
>>> Otis
>>> --
>>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>

Re: Fetching only unfetched URLs

Posted by John Martyniak <jo...@beforedawn.com>.

I think that this would be another good piece of functionality.  As I  
would like to continue to use the generate-fetch-update methodology  
but would like to mimic the functionality of Crawl, in that I can grab  
every page at a specific domain.

-John

On Dec 4, 2008, at 8:40 AM, Dennis Kubes wrote:

>
>
> Otis Gospodnetic wrote:
>> Hi,
>> If there an existing method for generating a segment/fetchlist  
>> containing only URLs that have not yet been fetched?
>> I'm asking because I can imagine a situation where one has a large  
>> and "old" CrawlDb that "knows" about a lot of URLs (the ones with  
>> "db_unfetched" status if you run -stats) and in such a situation a  
>> person may prefer to fetch only the yet-unfetched URLs first, and  
>> only after that include URLs that need to be refetched in the newly  
>> generated segments.
>
> I don't think a current method exists to do only unfetched URLs, but  
> it does sound like an interesting bit of functionality.
>
>> One can write a custom Generator, or perhaps modify the existing  
>> one to add this option, but is there an existing mechanism for this?
>
> Generator would probably be best, let me look into what it would  
> take to do this.  Maybe we can get it into 1.0.
>
> Dennis
>
>> If not, does this sound like something that should be added to the  
>> existing Generator and invoked via a command-line arg, say - 
>> unfetchedOnly ?
>> Thanks,
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Fetching only unfetched URLs

Posted by Dennis Kubes <ku...@apache.org>.


Otis Gospodnetic wrote:
> Hi,
> 
> If there an existing method for generating a segment/fetchlist containing only URLs that have not yet been fetched?
> I'm asking because I can imagine a situation where one has a large and "old" CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched" status if you run -stats) and in such a situation a person may prefer to fetch only the yet-unfetched URLs first, and only after that include URLs that need to be refetched in the newly generated segments.
> 

I don't think a current method exists to do only unfetched URLs, but it 
does sound like an interesting bit of functionality.

> One can write a custom Generator, or perhaps modify the existing one to add this option, but is there an existing mechanism for this?

Generator would probably be best, let me look into what it would take to 
do this.  Maybe we can get it into 1.0.

Dennis

> 
> If not, does this sound like something that should be added to the existing Generator and invoked via a command-line arg, say -unfetchedOnly ?
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>

Re: Fetching only unfetched URLs

Posted by ianwong <yi...@hotmail.com>.

Hi,

I have same requirement, anybody can offer a solution?
Do I need to recrawl it?

thanks
Ian

------------------------------------------
Hi,

If there an existing method for generating a segment/fetchlist containing
only URLs that have not yet been fetched?
I'm asking because I can imagine a situation where one has a large and "old"
CrawlDb that "knows" about a lot of URLs (the ones with "db_unfetched"
status if you run -stats) and in such a situation a person may prefer to
fetch only the yet-unfetched URLs first, and only after that include URLs
that need to be refetched in the newly generated segments.

One can write a custom Generator, or perhaps modify the existing one to add
this option, but is there an existing mechanism for this?

If not, does this sound like something that should be added to the existing
Generator and invoked via a command-line arg, say -unfetchedOnly ?

Thanks,
Otis
--

-- 
View this message in context: http://www.nabble.com/Fetching-only-unfetched-URLs-tp18058588p20831431.html
Sent from the Nutch - User mailing list archive at Nabble.com.