You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by 高睿 <ga...@163.com> on 2013/02/17 15:11:22 UTC

fetch/parse twice?

Hi,

There's only 1 url in table 'webpage'. I run command: bin/nutch crawl -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000, then I find the url is crawled twice.

Here's the log:
 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm

Do you know how to fix this?
Besides, when I run the command again. The same log is written in hadoop.log. I don't know why the configuration 'db.fetch.interval.default' in nutch-site.xml doesn't take effect.

Thanks.

Regards,
Rui

Re:Re: Re: fetch/parse twice?

Posted by 高睿 <ga...@163.com>.

Hi,

I have such configuration in nutch-site.xml:
        <property>
                <name>db.fetch.interval.default</name>
                <value>2592000</value>
                <description>The default number of seconds between re-fetches of a page (30 days).
                </description>
        </property>
       
</configuration>

So, I guess the fetch interval is configured correctly. Still, I don't know why this configuration doesn't take effect.







At 2013-02-18 10:16:47,"feng lu" <am...@gmail.com> wrote:
>Hi,
>
>May be that url has generated three times. One reason is that the url
>is reach the fetch time, so it will generate again. check your
>fetchInterval is set correctly.  Another reason is that the fetcher
>Markers doesn't remove the marker from the database, current marker is
>still GENERATE_MARK.
>
>You can run nutch comment step by step (generate->fetch-dbupdate) to
>see what happens.
>
>On 2/18/13, 高睿 <ga...@163.com> wrote:
>> Hi,
>>
>> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <le...@gmail.com>
>> wrote:
>>>Hi,
>>>Please make sure you have no temp files in the same directory and try again
>>>Please either use the crawl script which is provided with nutch or
>>>alternatively build your own script.
>>>
>>>
>>>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>>>> Hi,
>>>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>>
>>>> When I specify '-depth 1', the url is only crawled once, and If I specify
>>>'-depth 3', the url is crawled 3 times.
>>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>>in one go?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>>>>Hi,
>>>>>
>>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>>10000, then I find the url is crawled twice.
>>>>>
>>>>>Here's the log:
>>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>
>>>>>Do you know how to fix this?
>>>>>Besides, when I run the command again. The same log is written in
>>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>>in nutch-site.xml doesn't take effect.
>>>>>
>>>>>Thanks.
>>>>>
>>>>>Regards,
>>>>>Rui
>>>>
>>>
>>>--
>>>*Lewis*
>>
>
>
>-- 
>Don't Grow Old, Grow Up... :-)

Re: Re: fetch/parse twice?

Posted by feng lu <am...@gmail.com>.

Hi,

May be that url has generated three times. One reason is that the url
is reach the fetch time, so it will generate again. check your
fetchInterval is set correctly.  Another reason is that the fetcher
Markers doesn't remove the marker from the database, current marker is
still GENERATE_MARK.

You can run nutch comment step by step (generate->fetch-dbupdate) to
see what happens.

On 2/18/13, 高睿 <ga...@163.com> wrote:
> Hi,
>
> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>
>
>
>
>
>
>
>
> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <le...@gmail.com>
> wrote:
>>Hi,
>>Please make sure you have no temp files in the same directory and try again
>>Please either use the crawl script which is provided with nutch or
>>alternatively build your own script.
>>
>>
>>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>>> Hi,
>>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>
>>> When I specify '-depth 1', the url is only crawled once, and If I specify
>>'-depth 3', the url is crawled 3 times.
>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>in one go?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>>>Hi,
>>>>
>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>10000, then I find the url is crawled twice.
>>>>
>>>>Here's the log:
>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>
>>>>Do you know how to fix this?
>>>>Besides, when I run the command again. The same log is written in
>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>in nutch-site.xml doesn't take effect.
>>>>
>>>>Thanks.
>>>>
>>>>Regards,
>>>>Rui
>>>
>>
>>--
>>*Lewis*
>


-- 
Don't Grow Old, Grow Up... :-)

Re:Re: fetch/parse twice?

Posted by 高睿 <ga...@163.com>.

The urls dir is not specified in the command.

bin/nutch crawl -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000







在 2013-02-18 09:53:33，"Lewis John Mcgibbney" <le...@gmail.com> 写道：
>Wherever your url directory is kept
>
>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>> Hi,
>>
>> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>>
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>>>Hi,
>>>Please make sure you have no temp files in the same directory and try
>again
>>>Please either use the crawl script which is provided with nutch or
>>>alternatively build your own script.
>>>
>>>
>>>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>>>> Hi,
>>>> Additional, the nutch version is 2.1. And I have an ParserFilter to
>purge
>>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>>
>>>> When I specify '-depth 1', the url is only crawled once, and If I
>specify
>>>'-depth 3', the url is crawled 3 times.
>>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>>in one go?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>>>>Hi,
>>>>>
>>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>>10000, then I find the url is crawled twice.
>>>>>
>>>>>Here's the log:
>>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>>
>>>>>Do you know how to fix this?
>>>>>Besides, when I run the command again. The same log is written in
>>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>>in nutch-site.xml doesn't take effect.
>>>>>
>>>>>Thanks.
>>>>>
>>>>>Regards,
>>>>>Rui
>>>>
>>>
>>>--
>>>*Lewis*
>>
>
>-- 
>*Lewis*

Re: fetch/parse twice?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Wherever your url directory is kept

On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
> Hi,
>
> What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?
>
>
>
>
>
>
>
>
> At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>>Hi,
>>Please make sure you have no temp files in the same directory and try
again
>>Please either use the crawl script which is provided with nutch or
>>alternatively build your own script.
>>
>>
>>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>>> Hi,
>>> Additional, the nutch version is 2.1. And I have an ParserFilter to
purge
>>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>>
>>> When I specify '-depth 1', the url is only crawled once, and If I
specify
>>'-depth 3', the url is crawled 3 times.
>>> Is this expected behavior? Should I use command 'crawl' to do all works
>>in one go?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>>>Hi,
>>>>
>>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>>10000, then I find the url is crawled twice.
>>>>
>>>>Here's the log:
>>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>>
>>>>Do you know how to fix this?
>>>>Besides, when I run the command again. The same log is written in
>>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>>in nutch-site.xml doesn't take effect.
>>>>
>>>>Thanks.
>>>>
>>>>Regards,
>>>>Rui
>>>
>>
>>--
>>*Lewis*
>

-- 
*Lewis*

Re:Re: fetch/parse twice?

Posted by 高睿 <ga...@163.com>.

Hi,

What do you mean the same directory? '/tmp' or '${NUTCH_HOME}'?








At 2013-02-18 00:45:00,"Lewis John Mcgibbney" <le...@gmail.com> wrote:
>Hi,
>Please make sure you have no temp files in the same directory and try again
>Please either use the crawl script which is provided with nutch or
>alternatively build your own script.
>
>
>On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
>> Hi,
>> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
>outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>>
>> When I specify '-depth 1', the url is only crawled once, and If I specify
>'-depth 3', the url is crawled 3 times.
>> Is this expected behavior? Should I use command 'crawl' to do all works
>in one go?
>>
>>
>>
>>
>>
>>
>>
>> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>>Hi,
>>>
>>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
>-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
>10000, then I find the url is crawled twice.
>>>
>>>Here's the log:
>>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
>http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>>
>>>Do you know how to fix this?
>>>Besides, when I run the command again. The same log is written in
>hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
>in nutch-site.xml doesn't take effect.
>>>
>>>Thanks.
>>>
>>>Regards,
>>>Rui
>>
>
>-- 
>*Lewis*

Re: fetch/parse twice?

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi,
Please make sure you have no temp files in the same directory and try again
Please either use the crawl script which is provided with nutch or
alternatively build your own script.


On Sunday, February 17, 2013, 高睿 <ga...@163.com> wrote:
> Hi,
> Additional, the nutch version is 2.1. And I have an ParserFilter to purge
outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)
>
> When I specify '-depth 1', the url is only crawled once, and If I specify
'-depth 3', the url is crawled 3 times.
> Is this expected behavior? Should I use command 'crawl' to do all works
in one go?
>
>
>
>
>
>
>
> At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>>Hi,
>>
>>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl
-solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN
10000, then I find the url is crawled twice.
>>
>>Here's the log:
>> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing
http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>>
>>Do you know how to fix this?
>>Besides, when I run the command again. The same log is written in
hadoop.log. I don't know why the configuration 'db.fetch.interval.default'
in nutch-site.xml doesn't take effect.
>>
>>Thanks.
>>
>>Regards,
>>Rui
>

-- 
*Lewis*

Re:fetch/parse twice?

Posted by 高睿 <ga...@163.com>.

Hi,

Additional, the nutch version is 2.1. And I have an ParserFilter to purge outlinks of parse object. (by code: parse.setOutlinks(new Outlink[] {});)

When I specify '-depth 1', the url is only crawled once, and If I specify '-depth 3', the url is crawled 3 times.
Is this expected behavior? Should I use command 'crawl' to do all works in one go?







At 2013-02-17 22:11:22,"高睿" <ga...@163.com> wrote:
>Hi,
>
>There's only 1 url in table 'webpage'. I run command: bin/nutch crawl -solr http://localhost:8080/solr/collection2 -threads 10 -depth 2 -topN 10000, then I find the url is crawled twice.
>
>Here's the log:
> 55 2013-02-17 20:45:00,965 INFO  fetcher.FetcherJob - fetching http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
> 84 2013-02-17 20:45:11,021 INFO  parse.ParserJob - Parsing http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>215 2013-02-17 20:45:38,922 INFO  fetcher.FetcherJob - fetching http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>244 2013-02-17 20:45:46,031 INFO  parse.ParserJob - Parsing http://www.p5w.net/stock/lzft/gsyj/201209/t4470475.htm
>
>Do you know how to fix this?
>Besides, when I run the command again. The same log is written in hadoop.log. I don't know why the configuration 'db.fetch.interval.default' in nutch-site.xml doesn't take effect.
>
>Thanks.
>
>Regards,
>Rui