You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/03/08 17:49:26 UTC

help - distributed crawl in 0.7.1

Hi I am new here.
Could someone please let me know the step-by-step instructions to set up
distributed crawl in 0.7.1?
Thank you.

_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ 
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


Re: help - distributed crawl in 0.7.1

Posted by TDLN <di...@gmail.com>.
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem

Also, I think there have been several posts in the mailing list that contain
such a step-by-step overview.

Rgrds, Thomas

On 3/8/06, Olive g <ol...@hotmail.com> wrote:
>
> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee(r)
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
>
>

Re[2]: help - distributed crawl in 0.7.1

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Olive.


It is more stable.
I spared one week on learning 0.8's conception.
But, unfortunately rolled back to 0.7.1 version.
The only thing I needed in 0.8 is SWF Parser.


> Thank you so much for your reply!
> I just sent another message - because I am having other issues with 0.8 and
> somehow the
> TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
> thought 0.7.1 might
> be more stable?

> THe stats:
> 060308 064418 Client connection to 9.2.13.8:8010 : starting
> 060308 064418 Client connection to 9.2.13.8:8009: starting
> 060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
> 060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
> 060308 064419 Running job: job_ljydgp
> 060308 064420  map 0%
> 060308 064427  map 100%
> 060308 064433  reduce 100%
> 060308 064433 Job complete: job_ljydgp
> 060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
> 060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
> 060308 064436 Statistics for CrawlDb: 
> /user/root/crawl-20060307224144/crawldb
> 060308 064436 TOTAL urls:       1
> 060308 064436 avg score:        1.0
> 060308 064436 max score:        1.0
> 060308 064436 min score:        1.0
> 060308 064436 retry 0:  1
> 060308 064436 status 2 (DB_fetched):    1
> 060308 064437 CrawlDb statistics: done





>>From: Stefan Groschupf <sg...@media-style.com>
>>Reply-To: nutch-user@lucene.apache.org
>>To: nutch-user@lucene.apache.org
>>Subject: Re: help - distributed crawl in 0.7.1
>>Date: Wed, 8 Mar 2006 17:51:11 +0100
>>MIME-Version: 1.0 (Apple Message framework v746.2)
>>Received: from mail.apache.org ([209.237.227.199]) by 
>>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>>Mar 2006 08:51:36 -0800
>>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by
>>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>>Received: pass (asf.osuosl.org: local policy)
>>Received: from [212.122.60.61] (HELO mslinux.media-style.com) 
>>(212.122.60.61)    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar
>>2006 08:51:32 -0800
>>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com
>>(Postfix) with ESMTP id 21540144450for
>><nu...@lucene.apache.org>; Wed, 
>>  8 Mar 2006 17:43:21 +0100 (CET)
>>Received: from mslinux.media-style.com ([127.0.0.1])by localhost 
>>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP
>>id 18258-01 for <nu...@lucene.apache.org>;Wed, 8 Mar 2006 17:43:20
>>+0100 (CET)
>>Received: from [192.168.200.39] (unknown [212.122.60.61])by 
>>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for 
>><nu...@lucene.apache.org>; Wed,  8 Mar 2006 17:43:20 +0100 (CET)
>>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>>Precedence: bulk
>>List-Help: <ma...@lucene.apache.org>
>>List-Unsubscribe: <ma...@lucene.apache.org>
>>List-Post: <ma...@lucene.apache.org>
>>List-Id: <nutch-user.lucene.apache.org>
>>Delivered-To: mailing list nutch-user@lucene.apache.org
>>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>>X-Spam-Check-By: apache.org
>>References: <BA...@phx.gbl>
>>X-Mailer: Apple Mail (2.746.2)
>>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>>X-Virus-Checked: Checked by ClamAV on apache.org
>>Return-Path: 
>>nutch-user-return-4454-oliveg2005=hotmail.com@lucene.apache.org
>>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC) 
>>FILETIME=[901C1C70:01C642D0]
>>
>>Better you use nutch .8 to run a crawl using several machines.
>>There is some documentation in the wiki now.
>>
>>Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>Hi I am new here.
>>>Could someone please let me know the step-by-step instructions to  set up
>>>distributed crawl in 0.7.1?
>>>Thank you.
>>>
>>>_________________________________________________________________
>>>Is your PC infected? Get a FREE online computer virus scan from McAfee®
>>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>>
>>>
>>
>>---------------------------------------------------------------
>>company:        http://www.media-style.com
>>forum:        http://www.text-mining.org
>>blog:            http://www.find23.net
>>
>>

> _________________________________________________________________
> On the road to retirement? Check out MSN Life Events for advice on how to
> get there! http://lifeevents.msn.com/category.aspx?cid=Retirement



> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge


Re: help - distributed crawl in 0.7.1

Posted by Olive g <ol...@hotmail.com>.
Thanks! I saw that one too, but according to Doug, it was for 0.8 only. Does 
anyone have
step-by-step introductions like the one for 0.8?
Also, anyone knows why URL total is always 1 when I ran 0.8?
060308 064420  map 0%
060308 064427  map 100%
060308 064433  reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb: 
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls:       1
060308 064436 avg score:        1.0
060308 064436 max score:        1.0
060308 064436 min score:        1.0
060308 064436 retry 0:  1
060308 064436 status 2 (DB_fetched):    1
060308 064437 CrawlDb statistics: done


>From: TDLN <di...@gmail.com>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 18:00:06 +0100
>MIME-Version: 1.0
>Received: from mail.apache.org ([209.237.227.199]) by 
>bay0-mc7-f2.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8 
>Mar 2006 09:00:31 -0800
>Received: (qmail 90576 invoked by uid 500); 8 Mar 2006 17:00:31 -0000
>Received: (qmail 90565 invoked by uid 99); 8 Mar 2006 17:00:31 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by 
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:30 -0800
>Received: pass (asf.osuosl.org: domain of diamond108@gmail.com designates 
>64.233.162.200 as permitted sender)
>Received: from [64.233.162.200] (HELO zproxy.gmail.com) (64.233.162.200)    
>by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:29 -0800
>Received: by zproxy.gmail.com with SMTP id 4so235445nzn        for 
><nu...@lucene.apache.org>; Wed, 08 Mar 2006 09:00:08 -0800 (PST)
>Received: by 10.36.74.1 with SMTP id w1mr2304954nza;        Wed, 08 Mar 
>2006 09:00:06 -0800 (PST)
>Received: by 10.36.227.12 with HTTP; Wed, 8 Mar 2006 09:00:06 -0800 (PST)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>Precedence: bulk
>List-Help: <ma...@lucene.apache.org>
>List-Unsubscribe: <ma...@lucene.apache.org>
>List-Post: <ma...@lucene.apache.org>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list nutch-user@lucene.apache.org
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE,SPF_PASS
>X-Spam-Check-By: apache.org
>DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;        s=beta; 
>d=gmail.com;        
>h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references; 
>        
>b=dmLqLQUJPgvrB9Wiu1h1sG1pvL2DrxRpUM2bkCW36RjiyAo0t2/HebGIq4aNBW3Aoh83ko2xae64rHfJlg/+wzZIIayNqxJt0sq7xgLN3xuxfxBFltuBHVBPwkGK8WiyKTuk9ADXPG+G4yC1UGAUpVfc4fYGhcVDwsEC5GO2FAQ=
>References: <C3...@media-style.com> 
><BA...@phx.gbl>
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path: 
>nutch-user-return-4462-oliveg2005=hotmail.com@lucene.apache.org
>X-OriginalArrivalTime: 08 Mar 2006 17:00:32.0169 (UTC) 
>FILETIME=[CF644190:01C642D1]
>
>Detailed distributed crawl implementation:
>
>http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html
>
>I am not sure it applies to 0.7 though, but it  has a lot of info.
>
>Rgrds, Thomas

_________________________________________________________________
Don’t just search. Find. Check out the new MSN Search! 
http://search.msn.click-url.com/go/onm00200636ave/direct/01/


Re: help - distributed crawl in 0.7.1

Posted by TDLN <di...@gmail.com>.
Detailed distributed crawl implementation:

http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html

I am not sure it applies to 0.7 though, but it  has a lot of info.

Rgrds, Thomas

Re[4]: help - distributed crawl in 0.7.1

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Stefan.

Strange, I found  it more complicated..
Never mind, it's just my point of view :)
You wrote 8 ìàðòà 2006 ã., 21:11:09:

> I personal found the very latest source the most stable and easiest
> to use nutch version i ever used.
> Just my point of view.
> A lot of map reduce issues are fixed now, if distributed means run on
> serveral machines, I suggest 0.8.

> Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:

>> Hi,Stefan.
>>
>> I don't think so. 0.8 is more complicated.
>>
>>
>>> Better you use nutch .8 to run a crawl using several machines.
>>> There is some documentation in the wiki now.
>>
>>> Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>> Hi I am new here.
>>>> Could someone please let me know the step-by-step instructions to
>>>> set up
>>>> distributed crawl in 0.7.1?
>>>> Thank you.
>>>>
>>>> _________________________________________________________________
>>>> Is your PC infected? Get a FREE online computer virus scan from
>>>> McAfee® Security.
>>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>>> cid=3963
>>>>
>>>>
>>
>>> ---------------------------------------------------------------
>>> company:        http://www.media-style.com
>>> forum:        http://www.text-mining.org
>>> blog:            http://www.find23.net
>>
>>
>>
>>
>>> __________ NOD32 1.1434 (20060308) Information __________
>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>
>>
>>
>>
>> -- 
>> Regards,
>>  Dima                          mailto:nuther@proservice.ge
>>
>>

> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net




> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge


Re: Re[2]: help - distributed crawl in 0.7.1

Posted by Stefan Groschupf <sg...@media-style.com>.
I personal found the very latest source the most stable and easiest  
to use nutch version i ever used.
Just my point of view.
A lot of map reduce issues are fixed now, if distributed means run on  
serveral machines, I suggest 0.8.

Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:

> Hi,Stefan.
>
> I don't think so. 0.8 is more complicated.
>
>
>> Better you use nutch .8 to run a crawl using several machines.
>> There is some documentation in the wiki now.
>
>> Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>> Hi I am new here.
>>> Could someone please let me know the step-by-step instructions to
>>> set up
>>> distributed crawl in 0.7.1?
>>> Thank you.
>>>
>>> _________________________________________________________________
>>> Is your PC infected? Get a FREE online computer virus scan from
>>> McAfee® Security.
>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>> cid=3963
>>>
>>>
>
>> ---------------------------------------------------------------
>> company:        http://www.media-style.com
>> forum:        http://www.text-mining.org
>> blog:            http://www.find23.net
>
>
>
>
>> __________ NOD32 1.1434 (20060308) Information __________
>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>
>
>
>
> -- 
> Regards,
>  Dima                          mailto:nuther@proservice.ge
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re[2]: help - distributed crawl in 0.7.1

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Stefan.

I don't think so. 0.8 is more complicated.


> Better you use nutch .8 to run a crawl using several machines.
> There is some documentation in the wiki now.

> Am 08.03.2006 um 17:49 schrieb Olive g:

>> Hi I am new here.
>> Could someone please let me know the step-by-step instructions to  
>> set up
>> distributed crawl in 0.7.1?
>> Thank you.
>>
>> _________________________________________________________________
>> Is your PC infected? Get a FREE online computer virus scan from  
>> McAfee® Security.
>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp? 
>> cid=3963
>>
>>

> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net




> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge


Re: help - distributed crawl in 0.7.1

Posted by Olive g <ol...@hotmail.com>.
Thank you so much for your reply!
I just sent another message - because I am having other issues with 0.8 and 
somehow the
TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I 
thought 0.7.1 might
be more stable?

THe stats:
060308 064418 Client connection to 9.2.13.8:8010 : starting
060308 064418 Client connection to 9.2.13.8:8009: starting
060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
060308 064419 Running job: job_ljydgp
060308 064420  map 0%
060308 064427  map 100%
060308 064433  reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb: 
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls:       1
060308 064436 avg score:        1.0
060308 064436 max score:        1.0
060308 064436 min score:        1.0
060308 064436 retry 0:  1
060308 064436 status 2 (DB_fetched):    1
060308 064437 CrawlDb statistics: done





>From: Stefan Groschupf <sg...@media-style.com>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 17:51:11 +0100
>MIME-Version: 1.0 (Apple Message framework v746.2)
>Received: from mail.apache.org ([209.237.227.199]) by 
>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8 
>Mar 2006 08:51:36 -0800
>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by 
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>Received: pass (asf.osuosl.org: local policy)
>Received: from [212.122.60.61] (HELO mslinux.media-style.com) 
>(212.122.60.61)    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 
>2006 08:51:32 -0800
>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com 
>(Postfix) with ESMTP id 21540144450for <nu...@lucene.apache.org>; Wed, 
>  8 Mar 2006 17:43:21 +0100 (CET)
>Received: from mslinux.media-style.com ([127.0.0.1])by localhost 
>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP 
>id 18258-01 for <nu...@lucene.apache.org>;Wed, 8 Mar 2006 17:43:20 
>+0100 (CET)
>Received: from [192.168.200.39] (unknown [212.122.60.61])by 
>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for 
><nu...@lucene.apache.org>; Wed,  8 Mar 2006 17:43:20 +0100 (CET)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>Precedence: bulk
>List-Help: <ma...@lucene.apache.org>
>List-Unsubscribe: <ma...@lucene.apache.org>
>List-Post: <ma...@lucene.apache.org>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list nutch-user@lucene.apache.org
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>X-Spam-Check-By: apache.org
>References: <BA...@phx.gbl>
>X-Mailer: Apple Mail (2.746.2)
>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path: 
>nutch-user-return-4454-oliveg2005=hotmail.com@lucene.apache.org
>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC) 
>FILETIME=[901C1C70:01C642D0]
>
>Better you use nutch .8 to run a crawl using several machines.
>There is some documentation in the wiki now.
>
>Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>Hi I am new here.
>>Could someone please let me know the step-by-step instructions to  set up
>>distributed crawl in 0.7.1?
>>Thank you.
>>
>>_________________________________________________________________
>>Is your PC infected? Get a FREE online computer virus scan from  McAfeeŽ 
>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>
>>
>
>---------------------------------------------------------------
>company:        http://www.media-style.com
>forum:        http://www.text-mining.org
>blog:            http://www.find23.net
>
>

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to 
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement


Re: help - distributed crawl in 0.7.1

Posted by Stefan Groschupf <sg...@media-style.com>.
Better you use nutch .8 to run a crawl using several machines.
There is some documentation in the wiki now.

Am 08.03.2006 um 17:49 schrieb Olive g:

> Hi I am new here.
> Could someone please let me know the step-by-step instructions to  
> set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from  
> McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? 
> cid=3963
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: help - distributed crawl in 0.7.1

Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Olive.

Use www.nutch.org
Though tutorial is for 0.7, you can apply it to 0.7.1 version
If you have more exact question - ask :)
You wrote 8 марта 2006 г., 20:49:26:

> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.

> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee®
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963



> __________ NOD32 1.1434 (20060308) Information __________

> This message was checked by NOD32 antivirus system.
> http://www.eset.com




-- 
Regards,
 Dima                          mailto:nuther@proservice.ge