You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/03/08 17:49:26 UTC
help - distributed crawl in 0.7.1
Hi I am new here.
Could someone please let me know the step-by-step instructions to set up
distributed crawl in 0.7.1?
Thank you.
_________________________________________________________________
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
Re: help - distributed crawl in 0.7.1
Posted by TDLN <di...@gmail.com>.
You can start here http://wiki.apache.org/nutch/NutchDistributedFileSystem
Also, I think there have been several posts in the mailing list that contain
such a step-by-step overview.
Rgrds, Thomas
On 3/8/06, Olive g <ol...@hotmail.com> wrote:
>
> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee(r)
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
>
>
Re[2]: help - distributed crawl in 0.7.1
Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Olive.
It is more stable.
I spared one week on learning 0.8's conception.
But, unfortunately rolled back to 0.7.1 version.
The only thing I needed in 0.8 is SWF Parser.
> Thank you so much for your reply!
> I just sent another message - because I am having other issues with 0.8 and
> somehow the
> TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
> thought 0.7.1 might
> be more stable?
> THe stats:
> 060308 064418 Client connection to 9.2.13.8:8010 : starting
> 060308 064418 Client connection to 9.2.13.8:8009: starting
> 060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
> 060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
> 060308 064419 Running job: job_ljydgp
> 060308 064420 map 0%
> 060308 064427 map 100%
> 060308 064433 reduce 100%
> 060308 064433 Job complete: job_ljydgp
> 060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
> 060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
> 060308 064436 Statistics for CrawlDb:
> /user/root/crawl-20060307224144/crawldb
> 060308 064436 TOTAL urls: 1
> 060308 064436 avg score: 1.0
> 060308 064436 max score: 1.0
> 060308 064436 min score: 1.0
> 060308 064436 retry 0: 1
> 060308 064436 status 2 (DB_fetched): 1
> 060308 064437 CrawlDb statistics: done
>>From: Stefan Groschupf <sg...@media-style.com>
>>Reply-To: nutch-user@lucene.apache.org
>>To: nutch-user@lucene.apache.org
>>Subject: Re: help - distributed crawl in 0.7.1
>>Date: Wed, 8 Mar 2006 17:51:11 +0100
>>MIME-Version: 1.0 (Apple Message framework v746.2)
>>Received: from mail.apache.org ([209.237.227.199]) by
>>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>>Mar 2006 08:51:36 -0800
>>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by
>>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>>Received: pass (asf.osuosl.org: local policy)
>>Received: from [212.122.60.61] (HELO mslinux.media-style.com)
>>(212.122.60.61) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar
>>2006 08:51:32 -0800
>>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com
>>(Postfix) with ESMTP id 21540144450for
>><nu...@lucene.apache.org>; Wed,
>> 8 Mar 2006 17:43:21 +0100 (CET)
>>Received: from mslinux.media-style.com ([127.0.0.1])by localhost
>>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP
>>id 18258-01 for <nu...@lucene.apache.org>;Wed, 8 Mar 2006 17:43:20
>>+0100 (CET)
>>Received: from [192.168.200.39] (unknown [212.122.60.61])by
>>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for
>><nu...@lucene.apache.org>; Wed, 8 Mar 2006 17:43:20 +0100 (CET)
>>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>>Precedence: bulk
>>List-Help: <ma...@lucene.apache.org>
>>List-Unsubscribe: <ma...@lucene.apache.org>
>>List-Post: <ma...@lucene.apache.org>
>>List-Id: <nutch-user.lucene.apache.org>
>>Delivered-To: mailing list nutch-user@lucene.apache.org
>>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>>X-Spam-Check-By: apache.org
>>References: <BA...@phx.gbl>
>>X-Mailer: Apple Mail (2.746.2)
>>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>>X-Virus-Checked: Checked by ClamAV on apache.org
>>Return-Path:
>>nutch-user-return-4454-oliveg2005=hotmail.com@lucene.apache.org
>>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC)
>>FILETIME=[901C1C70:01C642D0]
>>
>>Better you use nutch .8 to run a crawl using several machines.
>>There is some documentation in the wiki now.
>>
>>Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>Hi I am new here.
>>>Could someone please let me know the step-by-step instructions to set up
>>>distributed crawl in 0.7.1?
>>>Thank you.
>>>
>>>_________________________________________________________________
>>>Is your PC infected? Get a FREE online computer virus scan from McAfee®
>>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>>
>>>
>>
>>---------------------------------------------------------------
>>company: http://www.media-style.com
>>forum: http://www.text-mining.org
>>blog: http://www.find23.net
>>
>>
> _________________________________________________________________
> On the road to retirement? Check out MSN Life Events for advice on how to
> get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
> __________ NOD32 1.1434 (20060308) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:nuther@proservice.ge
Re: help - distributed crawl in 0.7.1
Posted by Olive g <ol...@hotmail.com>.
Thanks! I saw that one too, but according to Doug, it was for 0.8 only. Does
anyone have
step-by-step introductions like the one for 0.8?
Also, anyone knows why URL total is always 1 when I ran 0.8?
060308 064420 map 0%
060308 064427 map 100%
060308 064433 reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb:
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls: 1
060308 064436 avg score: 1.0
060308 064436 max score: 1.0
060308 064436 min score: 1.0
060308 064436 retry 0: 1
060308 064436 status 2 (DB_fetched): 1
060308 064437 CrawlDb statistics: done
>From: TDLN <di...@gmail.com>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 18:00:06 +0100
>MIME-Version: 1.0
>Received: from mail.apache.org ([209.237.227.199]) by
>bay0-mc7-f2.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>Mar 2006 09:00:31 -0800
>Received: (qmail 90576 invoked by uid 500); 8 Mar 2006 17:00:31 -0000
>Received: (qmail 90565 invoked by uid 99); 8 Mar 2006 17:00:31 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:30 -0800
>Received: pass (asf.osuosl.org: domain of diamond108@gmail.com designates
>64.233.162.200 as permitted sender)
>Received: from [64.233.162.200] (HELO zproxy.gmail.com) (64.233.162.200)
>by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 09:00:29 -0800
>Received: by zproxy.gmail.com with SMTP id 4so235445nzn for
><nu...@lucene.apache.org>; Wed, 08 Mar 2006 09:00:08 -0800 (PST)
>Received: by 10.36.74.1 with SMTP id w1mr2304954nza; Wed, 08 Mar
>2006 09:00:06 -0800 (PST)
>Received: by 10.36.227.12 with HTTP; Wed, 8 Mar 2006 09:00:06 -0800 (PST)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>Precedence: bulk
>List-Help: <ma...@lucene.apache.org>
>List-Unsubscribe: <ma...@lucene.apache.org>
>List-Post: <ma...@lucene.apache.org>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list nutch-user@lucene.apache.org
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE,SPF_PASS
>X-Spam-Check-By: apache.org
>DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=beta;
>d=gmail.com;
>h=received:message-id:date:from:to:subject:in-reply-to:mime-version:content-type:references;
>
>b=dmLqLQUJPgvrB9Wiu1h1sG1pvL2DrxRpUM2bkCW36RjiyAo0t2/HebGIq4aNBW3Aoh83ko2xae64rHfJlg/+wzZIIayNqxJt0sq7xgLN3xuxfxBFltuBHVBPwkGK8WiyKTuk9ADXPG+G4yC1UGAUpVfc4fYGhcVDwsEC5GO2FAQ=
>References: <C3...@media-style.com>
><BA...@phx.gbl>
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path:
>nutch-user-return-4462-oliveg2005=hotmail.com@lucene.apache.org
>X-OriginalArrivalTime: 08 Mar 2006 17:00:32.0169 (UTC)
>FILETIME=[CF644190:01C642D1]
>
>Detailed distributed crawl implementation:
>
>http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html
>
>I am not sure it applies to 0.7 though, but it has a lot of info.
>
>Rgrds, Thomas
_________________________________________________________________
Dont just search. Find. Check out the new MSN Search!
http://search.msn.click-url.com/go/onm00200636ave/direct/01/
Re: help - distributed crawl in 0.7.1
Posted by TDLN <di...@gmail.com>.
Detailed distributed crawl implementation:
http://www.mail-archive.com/nutch-user@lucene.apache.org/msg02270.html
I am not sure it applies to 0.7 though, but it has a lot of info.
Rgrds, Thomas
Re[4]: help - distributed crawl in 0.7.1
Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Stefan.
Strange, I found it more complicated..
Never mind, it's just my point of view :)
You wrote 8 ìàðòà 2006 ã., 21:11:09:
> I personal found the very latest source the most stable and easiest
> to use nutch version i ever used.
> Just my point of view.
> A lot of map reduce issues are fixed now, if distributed means run on
> serveral machines, I suggest 0.8.
> Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:
>> Hi,Stefan.
>>
>> I don't think so. 0.8 is more complicated.
>>
>>
>>> Better you use nutch .8 to run a crawl using several machines.
>>> There is some documentation in the wiki now.
>>
>>> Am 08.03.2006 um 17:49 schrieb Olive g:
>>
>>>> Hi I am new here.
>>>> Could someone please let me know the step-by-step instructions to
>>>> set up
>>>> distributed crawl in 0.7.1?
>>>> Thank you.
>>>>
>>>> _________________________________________________________________
>>>> Is your PC infected? Get a FREE online computer virus scan from
>>>> McAfee® Security.
>>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>>> cid=3963
>>>>
>>>>
>>
>>> ---------------------------------------------------------------
>>> company: http://www.media-style.com
>>> forum: http://www.text-mining.org
>>> blog: http://www.find23.net
>>
>>
>>
>>
>>> __________ NOD32 1.1434 (20060308) Information __________
>>
>>> This message was checked by NOD32 antivirus system.
>>> http://www.eset.com
>>
>>
>>
>>
>> --
>> Regards,
>> Dima mailto:nuther@proservice.ge
>>
>>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
> __________ NOD32 1.1434 (20060308) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:nuther@proservice.ge
Re: Re[2]: help - distributed crawl in 0.7.1
Posted by Stefan Groschupf <sg...@media-style.com>.
I personal found the very latest source the most stable and easiest
to use nutch version i ever used.
Just my point of view.
A lot of map reduce issues are fixed now, if distributed means run on
serveral machines, I suggest 0.8.
Am 08.03.2006 um 19:03 schrieb Dima Mazmanov:
> Hi,Stefan.
>
> I don't think so. 0.8 is more complicated.
>
>
>> Better you use nutch .8 to run a crawl using several machines.
>> There is some documentation in the wiki now.
>
>> Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>> Hi I am new here.
>>> Could someone please let me know the step-by-step instructions to
>>> set up
>>> distributed crawl in 0.7.1?
>>> Thank you.
>>>
>>> _________________________________________________________________
>>> Is your PC infected? Get a FREE online computer virus scan from
>>> McAfee® Security.
>>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>>> cid=3963
>>>
>>>
>
>> ---------------------------------------------------------------
>> company: http://www.media-style.com
>> forum: http://www.text-mining.org
>> blog: http://www.find23.net
>
>
>
>
>> __________ NOD32 1.1434 (20060308) Information __________
>
>> This message was checked by NOD32 antivirus system.
>> http://www.eset.com
>
>
>
>
> --
> Regards,
> Dima mailto:nuther@proservice.ge
>
>
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net
Re[2]: help - distributed crawl in 0.7.1
Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Stefan.
I don't think so. 0.8 is more complicated.
> Better you use nutch .8 to run a crawl using several machines.
> There is some documentation in the wiki now.
> Am 08.03.2006 um 17:49 schrieb Olive g:
>> Hi I am new here.
>> Could someone please let me know the step-by-step instructions to
>> set up
>> distributed crawl in 0.7.1?
>> Thank you.
>>
>> _________________________________________________________________
>> Is your PC infected? Get a FREE online computer virus scan from
>> McAfee® Security.
>> http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
>> cid=3963
>>
>>
> ---------------------------------------------------------------
> company: http://www.media-style.com
> forum: http://www.text-mining.org
> blog: http://www.find23.net
> __________ NOD32 1.1434 (20060308) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:nuther@proservice.ge
Re: help - distributed crawl in 0.7.1
Posted by Olive g <ol...@hotmail.com>.
Thank you so much for your reply!
I just sent another message - because I am having other issues with 0.8 and
somehow the
TOTAL urls is always 1 when I search big sites such as www.yahoo.com. I
thought 0.7.1 might
be more stable?
THe stats:
060308 064418 Client connection to 9.2.13.8:8010 : starting
060308 064418 Client connection to 9.2.13.8:8009: starting
060308 064418 parsing file:/root/nutch/conf/nutch-default.xml
060308 064418 parsing file:/root/nutch/conf/nutch- site.xml
060308 064419 Running job: job_ljydgp
060308 064420 map 0%
060308 064427 map 100%
060308 064433 reduce 100%
060308 064433 Job complete: job_ljydgp
060308 064434 parsing file:/root/nutch/conf/nutch- default.xml
060308 064434 parsing file:/root/nutch/conf/nutch-site.xml
060308 064436 Statistics for CrawlDb:
/user/root/crawl-20060307224144/crawldb
060308 064436 TOTAL urls: 1
060308 064436 avg score: 1.0
060308 064436 max score: 1.0
060308 064436 min score: 1.0
060308 064436 retry 0: 1
060308 064436 status 2 (DB_fetched): 1
060308 064437 CrawlDb statistics: done
>From: Stefan Groschupf <sg...@media-style.com>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: help - distributed crawl in 0.7.1
>Date: Wed, 8 Mar 2006 17:51:11 +0100
>MIME-Version: 1.0 (Apple Message framework v746.2)
>Received: from mail.apache.org ([209.237.227.199]) by
>bay0-mc7-f18.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.211); Wed, 8
>Mar 2006 08:51:36 -0800
>Received: (qmail 65663 invoked by uid 500); 8 Mar 2006 16:51:35 -0000
>Received: (qmail 65652 invoked by uid 99); 8 Mar 2006 16:51:35 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49) by
>apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar 2006 08:51:35 -0800
>Received: pass (asf.osuosl.org: local policy)
>Received: from [212.122.60.61] (HELO mslinux.media-style.com)
>(212.122.60.61) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 08 Mar
>2006 08:51:32 -0800
>Received: from localhost (localhost [127.0.0.1])by mslinux.media-style.com
>(Postfix) with ESMTP id 21540144450for <nu...@lucene.apache.org>; Wed,
> 8 Mar 2006 17:43:21 +0100 (CET)
>Received: from mslinux.media-style.com ([127.0.0.1])by localhost
>(mslinux.media-style.com [127.0.0.1]) (amavisd-new, port 10024)with ESMTP
>id 18258-01 for <nu...@lucene.apache.org>;Wed, 8 Mar 2006 17:43:20
>+0100 (CET)
>Received: from [192.168.200.39] (unknown [212.122.60.61])by
>mslinux.media-style.com (Postfix) with ESMTP id D81A1144417for
><nu...@lucene.apache.org>; Wed, 8 Mar 2006 17:43:20 +0100 (CET)
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>Precedence: bulk
>List-Help: <ma...@lucene.apache.org>
>List-Unsubscribe: <ma...@lucene.apache.org>
>List-Post: <ma...@lucene.apache.org>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list nutch-user@lucene.apache.org
>X-ASF-Spam-Status: No, hits=0.0 required=10.0tests=HTML_MESSAGE
>X-Spam-Check-By: apache.org
>References: <BA...@phx.gbl>
>X-Mailer: Apple Mail (2.746.2)
>X-Virus-Scanned: by amavisd-new-20030616-p10 (Debian) at media-style.com
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path:
>nutch-user-return-4454-oliveg2005=hotmail.com@lucene.apache.org
>X-OriginalArrivalTime: 08 Mar 2006 16:51:36.0503 (UTC)
>FILETIME=[901C1C70:01C642D0]
>
>Better you use nutch .8 to run a crawl using several machines.
>There is some documentation in the wiki now.
>
>Am 08.03.2006 um 17:49 schrieb Olive g:
>
>>Hi I am new here.
>>Could someone please let me know the step-by-step instructions to set up
>>distributed crawl in 0.7.1?
>>Thank you.
>>
>>_________________________________________________________________
>>Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ
>>Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp? cid=3963
>>
>>
>
>---------------------------------------------------------------
>company: http://www.media-style.com
>forum: http://www.text-mining.org
>blog: http://www.find23.net
>
>
_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
Re: help - distributed crawl in 0.7.1
Posted by Stefan Groschupf <sg...@media-style.com>.
Better you use nutch .8 to run a crawl using several machines.
There is some documentation in the wiki now.
Am 08.03.2006 um 17:49 schrieb Olive g:
> Hi I am new here.
> Could someone please let me know the step-by-step instructions to
> set up
> distributed crawl in 0.7.1?
> Thank you.
>
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from
> McAfee® Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?
> cid=3963
>
>
---------------------------------------------------------------
company: http://www.media-style.com
forum: http://www.text-mining.org
blog: http://www.find23.net
Re: help - distributed crawl in 0.7.1
Posted by Dima Mazmanov <nu...@proservice.ge>.
Hi,Olive.
Use www.nutch.org
Though tutorial is for 0.7, you can apply it to 0.7.1 version
If you have more exact question - ask :)
You wrote 8 марта 2006 г., 20:49:26:
> Hi I am new here.
> Could someone please let me know the step-by-step instructions to set up
> distributed crawl in 0.7.1?
> Thank you.
> _________________________________________________________________
> Is your PC infected? Get a FREE online computer virus scan from McAfee®
> Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963
> __________ NOD32 1.1434 (20060308) Information __________
> This message was checked by NOD32 antivirus system.
> http://www.eset.com
--
Regards,
Dima mailto:nuther@proservice.ge