You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Dan Morrill <ra...@baker.edu> on 2006/04/02 16:49:44 UTC

RE: Multiple crawls how to get them to work together

Berlin, 

Sorry about the delay - I have dumped my entire experience on my blog
http://infosecandpolitics.blogspot.com including shell scripts, merging,
whole web crawls and the rest of the lot. The shell script was posted on
Thursday on the blog, and this morning was a wrap up of getting the whole
web search to function in the windows/cygwin environment. 

I am probably not using the software as intended, but windows poses a number
of interesting issues that are not observed in the Linux environment (I am
also ramped up in a SuSE Linux environment as well). 

Hope that helps, if anyone has questions about Nutch, tomcat, windows,
cygwin, I will happily help out. 

Cheers/r/Dan



-----Original Message-----
From: Berlin Brown [mailto:berlin.brown@gmail.com] 
Sent: Thursday, March 30, 2006 2:09 PM
To: nutch-user@lucene.apache.org
Subject: Re: Multiple crawls how to get them to work together

Do you have that shell script?

On 3/30/06, Dan Morrill <ra...@baker.edu> wrote:
> Hi folks,
>
> It worked, it worked great, I made a shell script to do the work for me.
> Thank you, thank you, and again, thank you.
>
> r/d
>
> -----Original Message-----
> From: Dan Morrill [mailto:ralph.morrill@baker.edu]
> Sent: Thursday, March 30, 2006 5:12 AM
> To: nutch-user@lucene.apache.org
> Subject: RE: Multiple crawls how to get them to work together
>
> Aled,
>
> I'll try that today, excellent, and thanks for the heads up on the db
> directory. I'll let you now how it goes.
>
> r/d
>
>
>
> -----Original Message-----
> From: Aled Jones [mailto:Aled.Jones@comtec-europe.co.uk]
> Sent: Thursday, March 30, 2006 12:24 AM
> To: nutch-user@lucene.apache.org
> Subject: ATB: Multiple crawls how to get them to work together
>
> Hi Dan
>
> I'll presume you've done the crawls already..
>
> Each resulting crawled folder should have 3 folders, db, index and
> segments.
>
> Create your search.dir folder and create a segments folder in that.
>
> Each segments folder in each crawl folder should contain folders with
> timestamps as the names.  Copy the contents of:
>
> crawlA/segments
> crawlB/segments
> crawlc/segments
>
> (i.e. The folders with timestamps as names)Into:
>
> search.dir/segments
>
> Next, delete the duplicates from the segments by running the command:
>
> bin/nutch dedup -local search.dir/segments
>
> Then you need to merge the segments to create an index folder, so run
> the command:
>
> bin/nutch merge -local search.dir/index search.dir/segments/*
>
> You should now have two folders in your search.dir:
> search.dir/segments
> search.dir/index
>
> That's all you need for serving pages (db folder is only used when
> fetching).
>
> Now just set the searcher.dir property value in nutch-site.xml to be the
> location of search.dir
>
> That's how I've been doing it, although it may not be the "right" way.
> :-) Hope this helps.
>
> Cheers
> Aled
>
>
> > -----Neges Wreiddiol-----/-----Original Message-----
> > Oddi wrth/From: Dan Morrill [mailto:ralph.morrill@baker.edu]
> > Anfonwyd/Sent: 29 March 2006 18:06
> > At/To: nutch-user@lucene.apache.org
> > Copi/Cc: rmorrill@gmail.com
> > Pwnc/Subject: Multiple crawls how to get them to work together
> >
> > Hi folks,
> >
> >
> >
> > I have 3 crawls, crawlA, crawlB, and crawlC. I would like all
> > of them to be available to the search.jsp page.
> >
> >
> >
> > I went through the site saw merge, index, make new db, and
> > followed all the directions that I could find, but still no
> > resolution on this one. So what I need are some idea's on
> > where to proceed from here, I intend on having 2 or
> > 3 boxes make a crawl, then somehow merge the crawls together
> > and form a "master" under search.dir. I would also want to
> > update this one on a regular basis.
> >
> >
> >
> > Unfortunately, the instructions to date have all been tried,
> > and have all lead to the idea not working. There is also no
> > indexmerger or indexsemgents directives in nutch 0.7.1. Any
> > support ideas, direct pointers, or even step-by-step
> > instructions on how to do this (outside of what is in the
> > tutorials because that has been tried already, including
> > support idea's in the user web mail list).
> >
> >
> >
> > Cheers/r/dan
> >
> >
> >
> >
> >
> >
> >
> >
> ###########################################
>
> This message has been scanned by F-Secure Anti-Virus for Microsoft
Exchange.
> For more information, connect to http://www.f-secure.com/
>
> ************************************************************************
> This e-mail and any attachments are strictly confidential and intended
> solely for the addressee. They may contain information which is covered by
> legal, professional or other privilege. If you are not the intended
> addressee, you must not copy the e-mail or the attachments, or use them
for
> any purpose or disclose their contents to any other person. To do so may
be
> unlawful. If you have received this transmission in error, please notify
us
> as soon as possible and delete the message and attachments from all places
> in your computer where they are stored.
>
> Although we have scanned this e-mail and any attachments for viruses, it
is
> your responsibility to ensure that they are actually virus free.
>
>
> =
>
>