You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Manoj Bist <ma...@gmail.com> on 2008/01/13 04:06:59 UTC

'crawled already exists' - how do I recrawl?

Hi,

When I run crawl the second time, it always complains that 'crawled' already
exists. I always need to remove this directory using 'hadoop dfs -rm
crawled' to get going.
Is there some way to avoid this error and tell nutch that its a recrawl?

bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log


Exception in thread "main" java.lang.RuntimeException: crawled already
exists.
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)

Thanks,

Manoj.

-- 
Tired of reading blogs? Listen to  your favorite blogs at
http://www.blogbard.com   !!!!

Re: 'crawled already exists' - how do I recrawl?

Posted by Susam Pal <su...@gmail.com>.
The script creates a 'crawl' directory in the present working directory.

Where is your Nutch directory and where are you running the script. I
usually change directory to the top level Nutch directory, put the
script in the 'bin' directory, chmod a+x bin/crawl and then run it as
bin/crawl. So, as per this setup the crawl_generate directory should
be created in: crawl/segments/<segment-number>/crawl-generate (a
typical example of segment-number: 20080102215525).

Your error seems to come from this statement in the script:-

$NUTCH_HOME/bin/nutch fetch $segment -threads $threads

Fetcher tries to access $segment/crawl_generate in the beginning. In
your case the fetcher is trying to open:
/user/nutch/-threads/crawl_generate

So, it seems the above statement is resolved to:-

$NUTCH_HOME/bin/nutch fetch /user/nutch/-threads $threads

This means your $segment is /user/nutch and a space is missing between
$segment and -threads. Have you modified the script, altered the paths
but missed the space accidentally.

I hope this information and the script helps you to resolve the
problem. Whatever the result is, please let us know. This would help
us to improve the script if needed.

Regards,
Susam Pal

On Jan 13, 2008 11:19 AM, Manoj Bist <ma...@gmail.com> wrote:
> Thanks for the response.
> I tried this with nutch-0.9. The script seems to be accessing non-existent
> file/dirs.
>
> Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
> exist : /user/nutch/-threads/crawl_generate
>         at org.apache.hadoop.mapred.InputFormatBase.validateInput(
> InputFormatBase.java:138)
>         at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
>         at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
>         at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
>         at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>         at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)
>
>
>
>
> On Jan 12, 2008 9:00 PM, Susam Pal <su...@gmail.com> wrote:
>
> > You can try the crawl script: http://wiki.apache.org/nutch/Crawl
> >
> > Regards,
> > Susam Pal
> >
> > On Jan 13, 2008 8:36 AM, Manoj Bist <ma...@gmail.com> wrote:
> > > Hi,
> > >
> > > When I run crawl the second time, it always complains that 'crawled'
> > already
> > > exists. I always need to remove this directory using 'hadoop dfs -rm
> > > crawled' to get going.
> > > Is there some way to avoid this error and tell nutch that its a recrawl?
> > >
> > > bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
> > >
> > >
> > > Exception in thread "main" java.lang.RuntimeException: crawled already
> > > exists.
> > >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
> > >
> > > Thanks,
> > >
> > > Manoj.
> > >
> > > --
> > > Tired of reading blogs? Listen to  your favorite blogs at
> > > http://www.blogbard.com   !!!!
> > >
> >
>
>
>
> --
>
> Tired of reading blogs? Listen to  your favorite blogs at
> http://www.blogbard.com   !!!!
>

Re: 'crawled already exists' - how do I recrawl?

Posted by Manoj Bist <ma...@gmail.com>.
Thanks for the response.
I tried this with nutch-0.9. The script seems to be accessing non-existent
file/dirs.

Fetcher: org.apache.hadoop.mapred.InvalidInputException: Input path doesnt
exist : /user/nutch/-threads/crawl_generate
        at org.apache.hadoop.mapred.InputFormatBase.validateInput(
InputFormatBase.java:138)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:326)
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:543)
        at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:470)
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:505)
        at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:477)



On Jan 12, 2008 9:00 PM, Susam Pal <su...@gmail.com> wrote:

> You can try the crawl script: http://wiki.apache.org/nutch/Crawl
>
> Regards,
> Susam Pal
>
> On Jan 13, 2008 8:36 AM, Manoj Bist <ma...@gmail.com> wrote:
> > Hi,
> >
> > When I run crawl the second time, it always complains that 'crawled'
> already
> > exists. I always need to remove this directory using 'hadoop dfs -rm
> > crawled' to get going.
> > Is there some way to avoid this error and tell nutch that its a recrawl?
> >
> > bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
> >
> >
> > Exception in thread "main" java.lang.RuntimeException: crawled already
> > exists.
> >         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
> >
> > Thanks,
> >
> > Manoj.
> >
> > --
> > Tired of reading blogs? Listen to  your favorite blogs at
> > http://www.blogbard.com   !!!!
> >
>



-- 
Tired of reading blogs? Listen to  your favorite blogs at
http://www.blogbard.com   !!!!

Re: 'crawled already exists' - how do I recrawl?

Posted by nghianghesi <ng...@yahoo.com>.
the script work well (Nutch 0.9)
However, I have some concerns: 
As log in screen, and review code, the script re-index all database --> low
speed (long as index new one)
--> Is there any way to re-index only changed pages?
The generate step is long also
--> can improve it?
The db.default.fetch.interval is for all page
--> Is there any way to configure it adaptive, i mean some pages need to be
re-indexed every day such as home page of news site

Thanks
Nghia Nguyen.


Susam Pal wrote:
> 
> You can try the crawl script: http://wiki.apache.org/nutch/Crawl
> 
> Regards,
> Susam Pal
> 
> On Jan 13, 2008 8:36 AM, Manoj Bist <ma...@gmail.com> wrote:
>> Hi,
>>
>> When I run crawl the second time, it always complains that 'crawled'
>> already
>> exists. I always need to remove this directory using 'hadoop dfs -rm
>> crawled' to get going.
>> Is there some way to avoid this error and tell nutch that its a recrawl?
>>
>> bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
>>
>>
>> Exception in thread "main" java.lang.RuntimeException: crawled already
>> exists.
>>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
>>
>> Thanks,
>>
>> Manoj.
>>
>> --
>> Tired of reading blogs? Listen to  your favorite blogs at
>> http://www.blogbard.com   !!!!
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/%27crawled-already-exists%27---how-do-I-recrawl--tp14781783p14841677.html
Sent from the Nutch - User mailing list archive at Nabble.com.


Re: 'crawled already exists' - how do I recrawl?

Posted by Susam Pal <su...@gmail.com>.
You can try the crawl script: http://wiki.apache.org/nutch/Crawl

Regards,
Susam Pal

On Jan 13, 2008 8:36 AM, Manoj Bist <ma...@gmail.com> wrote:
> Hi,
>
> When I run crawl the second time, it always complains that 'crawled' already
> exists. I always need to remove this directory using 'hadoop dfs -rm
> crawled' to get going.
> Is there some way to avoid this error and tell nutch that its a recrawl?
>
> bin/nutch crawl urls -dir crawled -depth 1  2>&1 | tee /tmp/foo.log
>
>
> Exception in thread "main" java.lang.RuntimeException: crawled already
> exists.
>         at org.apache.nutch.crawl.Crawl.main(Crawl.java:85)
>
> Thanks,
>
> Manoj.
>
> --
> Tired of reading blogs? Listen to  your favorite blogs at
> http://www.blogbard.com   !!!!
>