You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Olive g <ol...@hotmail.com> on 2006/04/03 21:14:38 UTC

thanks, but what I wanted to do is to merge segments from multiple crawls

For example, I want to crawl 20,000 pages everyday for 10 days and then 
merge
the data for search. So far, I can't get it to work.
Any advices? Could someone let me know whether I can do this on 0.8 at all?

Thank you.


>From: "Gal Nitzan" <gn...@usa.net>
>Reply-To: <gn...@usa.net>
>To: <nu...@lucene.apache.org>
>Subject: RE: help please! - issues with merging indexes w/ DFS on 0.8 Date: 
>Mon, 3 Apr 2006 19:50:48 +0200
>MIME-Version: 1.0
>Received: from mail.apache.org ([209.237.227.199]) by 
>bay0-mc6-f17.bay0.hotmail.com with Microsoft SMTPSVC(6.0.3790.1830); Mon, 3 
>Apr 2006 09:50:38 -0700
>Received: (qmail 22386 invoked by uid 500); 3 Apr 2006 16:50:38 -0000
>Received: (qmail 22375 invoked by uid 99); 3 Apr 2006 16:50:38 -0000
>Received: from asf.osuosl.org (HELO asf.osuosl.org) (140.211.166.49)    by 
>apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Apr 2006 09:50:38 -0700
>Received: neutral (asf.osuosl.org: local policy)
>Received: from [165.212.64.31] (HELO cmsout01.mbox.net) (165.212.64.31)    
>by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 03 Apr 2006 09:50:36 -0700
>Received: from cmsout01.mbox.net (cmsout01.mbox.net [165.212.64.31])by 
>cmsout01.mbox.net (Postfix) with ESMTP id CDD8778150for 
><nu...@lucene.apache.org>; Mon,  3 Apr 2006 16:50:12 +0000 (GMT)
>Received: from uadvg137.cms.usa.net [165.212.11.137] by cmsout01.mbox.net 
>via smtad (C8.MAIN.3.27X); Mon, 03 Apr 2006 16:50:12 GMT
>Received: from GALTOP [212.143.37.129] by 
>uadvg137.cms.usa.net(ASMTP/gnitzan@usa.net) via mtad (C8.MAIN.3.27X) with 
>ESMTP id 775kDcqYL0060M37; Mon, 03 Apr 2006 16:50:11 GMT
>X-Message-Info: JGTYoYF78jEHjJx36Oi8+Z3TmmkSEdPtfpLB7P/ybN8=
>Mailing-List: contact nutch-user-help@lucene.apache.org; run by ezmlm
>Precedence: bulk
>List-Help: <ma...@lucene.apache.org>
>List-Unsubscribe: <ma...@lucene.apache.org>
>List-Post: <ma...@lucene.apache.org>
>List-Id: <nutch-user.lucene.apache.org>
>Delivered-To: mailing list nutch-user@lucene.apache.org
>X-ASF-Spam-Status: No, hits=0.3 required=10.0tests=MAILTO_TO_SPAM_ADDR
>X-Spam-Check-By: apache.org
>X-USANET-Source: 165.212.11.137  IN   gnitzan@usa.net uadvg137.cms.usa.net
>X-USANET-MsgId: XID682kDcqYm4227X01
>X-USANET-Auth: 212.143.37.129  AUTH gnitzan@usa.net GALTOP
>X-Mailer: Microsoft Office Outlook 11
>X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2670
>Thread-Index: AcZXLmysUv2gMyKTSBO/nXaAbohfRQAFzsUg
>Z-USANET-MsgId: XID775kDcqYL0060X37
>X-Virus-Checked: Checked by ClamAV on apache.org
>Return-Path: 
>nutch-user-return-4868-oliveg2005=hotmail.com@lucene.apache.org
>X-OriginalArrivalTime: 03 Apr 2006 16:50:39.0019 (UTC) 
>FILETIME=[BC963FB0:01C6573E]
>
>Hi,
>
>I'm not sure what you are doing so I will just describe hat I'm doing and
>maybe you would find the answer :)
>
>Let's make some assumptions:
>1. main nutch dfs is: /user/nutchuser
>1.1 you should have /user/nutchuser/crawldb
>1.2 you should have /user/nutchuser/segments with some fetched segments in
>it
>2. now you need to create the linkdb
>2.1 bin/nutch invertlinks linkdb -dir segments
>3. now you need to index your segments
>3.1 bin/nutch index indexes crawldb linkdb segments/segment1
>segments/segment2 add all your segments here one after another
>4. now remove duplicates
>4.1 bin/nutch dedup indexes
>5. NOW merge
>5.1 bin/nutch merge index indexes (this command will create a new folder
>named /user/nutchuser/index whuch contains your new merged index).
>
>Hope it helps.
>
>Gal.
>
>
>-----Original Message-----
>From: Olive g [mailto:oliveg2005@hotmail.com]
>Sent: Monday, April 03, 2006 4:53 PM
>To: nutch-user@lucene.apache.org
>Subject: help please! - issues with merging indexes w/ DFS on 0.8
>
>Hi gurus,
>
>I ran into similar issue as
>http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04073.html.
>I just could not get index merging to work. I've browsed the mailing
>archives the tried many things
>and nothing worked so far. Does 0.8 support merging indexes?
>
>I really appreciate any help.
>
>Thank you!
>
>Olive
>
>_________________________________________________________________
>On the road to retirement? Check out MSN Life Events for advice on how to
>get there! http://lifeevents.msn.com/category.aspx?cid=Retirement
>
>
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: thanks, but what I wanted to do is to merge segments from multiple crawls

Posted by Andrzej Bialecki <ab...@getopt.org>.
Olive g wrote:
> For example, I want to crawl 20,000 pages everyday for 10 days and 
> then merge
> the data for search. So far, I can't get it to work.
> Any advices? Could someone let me know whether I can do this on 0.8 at 
> all?

Not yet. This functionality hasn't been ported yet from 0.7. It's on my 
TODO list, but with a low priority.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com