You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Vertical Search <ve...@gmail.com> on 2006/03/27 21:25:03 UTC

Merging indexes -- please help....

So far, I have been able to work through minor obstacles in setting up Nutch
for a vertical search.
Now am kind of stuck from past 24 hours to merge the indexes.
I have crawled multiple sites. but want to merge the indexes.
I am using merge command as follows


bin/nutch merge C:/vSearch/DB/index C:/global_search/index

All it says is "Adding C:/vSearch/DB/index" but the indexed data itself is
not consolidated and merged.

Can some one point me to the correct mail in archieve or help me get over
this problem...
Another question is in IndexMerger, I see -workingdir as a flag. tried with
that too with no avail...

Please.......... help..

Thanks

Re: Merging indexes -- please help....

Posted by Zaheed Haque <za...@gmail.com>.
First of all I don't really understand your question. Sorry :-)
Anyway I will make a guess. You have used the command
"crawl" to create your index correct? and you are asking for
how to merge "two or more of such crawled data" into one? no?

On 3/28/06, Berlin Brown <be...@gmail.com> wrote:
> Side comment.  I asked this before, is it possible to merge a two
> databases?  For example.
>
> bin/nutch ....crawl   ...output to DB
> bin/nutch ....crawl   ...output to NEW_DB_HERE
>
> bin/nutch index DB
> bin/nutch index NEW_DB_HERE
>
> bin/nutch merge index indexes
>
> Does this even make sense?
>
> On 3/28/06, Zaheed Haque <za...@gmail.com> wrote:
> > I am guessing you have used 0.8-dev, I am not sure how things are in 0.7
> >
> > bin/nutch index <index> <crawldb> <linkdb> <segment>
> >
> > where <index> = name of your index directory! NOTE you cannot call it
> > "index" (Please correct me if I am wrong). This is why it is good to
> > call it "indexes" or "id"
> > <crawldb> = name of your crawldb directory
> > <linkdb> = name of your linkdb directory
> > <segment> = name of your segment directory! not "segments" directory
> > meaning  "segments/200612120937"  or something similler.
> >
> > Now you need to index all your segments using the command above. And
> > the <index> folder should be the same but the segment folder will
> > change offcourse.
> > Example
> >
> > >bin/nutch index indexes crawldb linkdb segment/200612029232
> > >bin/nutch index indexes crawldb linkdb segment/200612022212
> >
> > Once the indexing is done for all the fetched segments then try the following
> >
> > >bin/nutch merge index indexes
> >
> > Note "index" above is where the merged index is placed for searching
> > (i.e. output index) and "indexes" folder is where your segments index
> > are.
> >
> > Hope this helps. I don't know how things are in windows with cygwin.
> >
> >
> > On 3/27/06, Vertical Search <ve...@gmail.com> wrote:
> > > So far, I have been able to work through minor obstacles in setting up Nutch
> > > for a vertical search.
> > > Now am kind of stuck from past 24 hours to merge the indexes.
> > > I have crawled multiple sites. but want to merge the indexes.
> > > I am using merge command as follows
> > >
> > >
> > > bin/nutch merge C:/vSearch/DB/index C:/global_search/index
> > >
> > > All it says is "Adding C:/vSearch/DB/index" but the indexed data itself is
> > > not consolidated and merged.
> > >
> > > Can some one point me to the correct mail in archieve or help me get over
> > > this problem...
> > > Another question is in IndexMerger, I see -workingdir as a flag. tried with
> > > that too with no avail...
> > >
> > > Please.......... help..
> > >
> > > Thanks
> > >
> >
>


--
Best Regards
Zaheed Haque
Phone : +46 735 000006
E.mail: zaheed.haque@gmail.com

Re: Merging indexes -- please help....

Posted by Berlin Brown <be...@gmail.com>.
Side comment.  I asked this before, is it possible to merge a two
databases?  For example.

bin/nutch ....crawl   ...output to DB
bin/nutch ....crawl   ...output to NEW_DB_HERE

bin/nutch index DB
bin/nutch index NEW_DB_HERE

bin/nutch merge index indexes

Does this even make sense?

On 3/28/06, Zaheed Haque <za...@gmail.com> wrote:
> I am guessing you have used 0.8-dev, I am not sure how things are in 0.7
>
> bin/nutch index <index> <crawldb> <linkdb> <segment>
>
> where <index> = name of your index directory! NOTE you cannot call it
> "index" (Please correct me if I am wrong). This is why it is good to
> call it "indexes" or "id"
> <crawldb> = name of your crawldb directory
> <linkdb> = name of your linkdb directory
> <segment> = name of your segment directory! not "segments" directory
> meaning  "segments/200612120937"  or something similler.
>
> Now you need to index all your segments using the command above. And
> the <index> folder should be the same but the segment folder will
> change offcourse.
> Example
>
> >bin/nutch index indexes crawldb linkdb segment/200612029232
> >bin/nutch index indexes crawldb linkdb segment/200612022212
>
> Once the indexing is done for all the fetched segments then try the following
>
> >bin/nutch merge index indexes
>
> Note "index" above is where the merged index is placed for searching
> (i.e. output index) and "indexes" folder is where your segments index
> are.
>
> Hope this helps. I don't know how things are in windows with cygwin.
>
>
> On 3/27/06, Vertical Search <ve...@gmail.com> wrote:
> > So far, I have been able to work through minor obstacles in setting up Nutch
> > for a vertical search.
> > Now am kind of stuck from past 24 hours to merge the indexes.
> > I have crawled multiple sites. but want to merge the indexes.
> > I am using merge command as follows
> >
> >
> > bin/nutch merge C:/vSearch/DB/index C:/global_search/index
> >
> > All it says is "Adding C:/vSearch/DB/index" but the indexed data itself is
> > not consolidated and merged.
> >
> > Can some one point me to the correct mail in archieve or help me get over
> > this problem...
> > Another question is in IndexMerger, I see -workingdir as a flag. tried with
> > that too with no avail...
> >
> > Please.......... help..
> >
> > Thanks
> >
>

Re: Merging indexes -- please help....

Posted by Zaheed Haque <za...@gmail.com>.
You might want to try this but I am not sure if it works :-) Please
make backups before!! This is a work around..

I assume that you have two working index i.e "CrawlA" and "CrawlB"
(Ready to go and works like a charm via the browser :-). Ok I am
taking for granted that all directory like index, indexes, segments
etc are in the directory "CrawlA" and "CrawlB"

Now make a new directory called "CrawlC"

mkdir CrawlC
cd CrawlC
mkdir crawldb
cd crawldb
mkdir current
cd current

Now copy the

cp -r CrawlA/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00000
cp -r CrawlB/crawldb/current/part-00000 to CrawlC/crawldb/current/part-00001

NOTE the part-00001

Now make a directory segments under CrawlC
cd to CrawlC/segments

Now copy the

cp-r CrawlA/segments/* to CrawlC/segments/*
cp-r CrawlB/segments/* to CrawlC/segments/*

etc..

Now you should have under CrawlC two directory

crawldb
segments

Proceed with

- bin/nutch invertlinks linkdb segments/*
- bin/nutch index indexes crawldb linkdb segments/*
- bin/nutch dedup indexes
- bin/nutch merge index indexes

Change your searcher.dir in nutch-site.xml and give it a go..
Cheers

On 4/4/06, Olive g <ol...@hotmail.com> wrote:
> We too have deadlines :(.
>
> I would appreciate it very much if someone can provide more insight. Is it a
> bug or
> configuration issue? How can we even do incremental crawsl on 0.8 with these
> issues?
>
> Should I send email to the developer mailing list? Would that help?
>
> Gurus, please help !!!!
>
>
>
> >From: "Vertical Search" <ve...@gmail.com>
> >Reply-To: nutch-user@lucene.apache.org
> >To: nutch-user@lucene.apache.org
> >Subject: Re: Merging indexes -- please help....
> >Date: Tue, 4 Apr 2006 10:11:51 -0500
> >
> >Sorry. I too have faced the same problem.. I am in process of releasing for
> >a demo  (mangement) over this weekend.
> >I will try to work on merging stuff after that... IT is a very important
> >part and have to get it to work, if I have to succeed in adopting Nutch for
> >a vertical domain.
> >Further more. I could not get the PruneIndexTool up and running.
> >It asks for query. I wonder if some can share the query file or format, the
> >tool expects.
> >
> >But goes without saying.. I am very thankful for folks here extending the
> >help.
> >
> >Thanks
> >
> >
> >
> >On 4/4/06, Olive g <ol...@hotmail.com> wrote:
> > >
> > > Hi,
> > >
> > > I encountered the same problem on 0.8. See my post
> > >
> >http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
> > > Anyone has any idea? Is it a bug or a configuration issue? Please let me
> > > know.
> > > Thanks.
> > >
> > > Olive
> > >
> > > >From: "Dan Morrill" <ra...@baker.edu>
> > > >Reply-To: nutch-user@lucene.apache.org
> > > >To: <nu...@lucene.apache.org>
> > > >Subject: RE: Merging indexes -- please help....
> > > >Date: Mon, 3 Apr 2006 05:18:34 -0700
> > > >
> > > >Hi,
> > > >
> > > >I noticed that when I used the drive designation that it didn't like
> >that
> > > >(windows cygwin environment) if you did
> > > >
> > > >./nutch merge -local /STG1/index /STG1/indexes that may work better,
> >let
> > > me
> > > >know.
> > > >
> > > >Cheers/r/dan
> > > >H
> > > >-----Original Message-----
> > > >From: Vertical Search [mailto:vertical.searchh@gmail.com]
> > > >Sent: Sunday, April 02, 2006 7:07 PM
> > > >To: nutch-user@lucene.apache.org
> > > >Subject: Re: Merging indexes -- please help....
> > > >
> > > >Okay.
> > > >I had 2 sets of crawl
> > > >such as E:/STG1 and E/STG2
> > > >I used the dedup command to remove duplicates
> > > >Then I the command i used to merge is as follows
> > > ><based on what have been available on mail archieves and responses I
> >got
> > > >
> > > >First I can
> > > >
> > > >  bin/nutch merge E:/STG1/index E:/STG1/indexes
> > > >   bin/nutch merge E:/STG1/index E:/STG2/indexes
> > > >
> > > >In the nutch-site .xml I have searcher.dir ad E:/STG1
> > > >
> > > >I get the absolutely no results...The command console is as follows.
> > > >Can some one shed some light on this please ASAP..
> > > >
> > > >INFO: creating new bean
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening merged index in E:\Hoodukoo\STG5\index
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening segments in E:\Hoodukoo\STG5\segments
> > > >Apr 2, 2006 8:58:36 PM
> > > >org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
> > > >der
> > > >INFO: found resource common-terms.utf8 at
> > > >file:/C:/xampp/tomcat/webapps/hoodukoo
> > > >/WEB-INF/classes/common-terms.utf8
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > > >INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
> > > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > > >INFO: query request from 127.0.0.1
> > > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > > >INFO: query: site
> > > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
> > > >INFO: searching for 20 raw hits
> > > >
> > >
> > > _________________________________________________________________
> > > Express yourself instantly with MSN Messenger! Download today - it's
> >FREE!
> > > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> > >
> > >
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

Re: Merging indexes -- please help....

Posted by Andrzej Bialecki <ab...@getopt.org>.
Olive g wrote:
> We too have deadlines :(.
>
> I would appreciate it very much if someone can provide more insight. 
> Is it a bug or
> configuration issue? How can we even do incremental crawsl on 0.8 with 
> these issues?
>
> Should I send email to the developer mailing list? Would that help?
>
> Gurus, please help !!!!

Gurus have deadlines, too.

The answer is: don't use the 'nutch crawl' tool, use individual tools 
step by step, as it is described in the tutorial - they provide full 
incremental crawling/indexing support.

The Crawl tool is at the moment a very simplistic tool to do one-shot 
crawling to get you started, i.e. you cannot use it to update existing 
DBs/segments. And there is no tool to merge DBs yet.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Merging indexes -- please help....

Posted by Olive g <ol...@hotmail.com>.
We too have deadlines :(.

I would appreciate it very much if someone can provide more insight. Is it a 
bug or
configuration issue? How can we even do incremental crawsl on 0.8 with these 
issues?

Should I send email to the developer mailing list? Would that help?

Gurus, please help !!!!



>From: "Vertical Search" <ve...@gmail.com>
>Reply-To: nutch-user@lucene.apache.org
>To: nutch-user@lucene.apache.org
>Subject: Re: Merging indexes -- please help....
>Date: Tue, 4 Apr 2006 10:11:51 -0500
>
>Sorry. I too have faced the same problem.. I am in process of releasing for
>a demo  (mangement) over this weekend.
>I will try to work on merging stuff after that... IT is a very important
>part and have to get it to work, if I have to succeed in adopting Nutch for
>a vertical domain.
>Further more. I could not get the PruneIndexTool up and running.
>It asks for query. I wonder if some can share the query file or format, the
>tool expects.
>
>But goes without saying.. I am very thankful for folks here extending the
>help.
>
>Thanks
>
>
>
>On 4/4/06, Olive g <ol...@hotmail.com> wrote:
> >
> > Hi,
> >
> > I encountered the same problem on 0.8. See my post
> > 
>http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
> > Anyone has any idea? Is it a bug or a configuration issue? Please let me
> > know.
> > Thanks.
> >
> > Olive
> >
> > >From: "Dan Morrill" <ra...@baker.edu>
> > >Reply-To: nutch-user@lucene.apache.org
> > >To: <nu...@lucene.apache.org>
> > >Subject: RE: Merging indexes -- please help....
> > >Date: Mon, 3 Apr 2006 05:18:34 -0700
> > >
> > >Hi,
> > >
> > >I noticed that when I used the drive designation that it didn't like 
>that
> > >(windows cygwin environment) if you did
> > >
> > >./nutch merge -local /STG1/index /STG1/indexes that may work better, 
>let
> > me
> > >know.
> > >
> > >Cheers/r/dan
> > >H
> > >-----Original Message-----
> > >From: Vertical Search [mailto:vertical.searchh@gmail.com]
> > >Sent: Sunday, April 02, 2006 7:07 PM
> > >To: nutch-user@lucene.apache.org
> > >Subject: Re: Merging indexes -- please help....
> > >
> > >Okay.
> > >I had 2 sets of crawl
> > >such as E:/STG1 and E/STG2
> > >I used the dedup command to remove duplicates
> > >Then I the command i used to merge is as follows
> > ><based on what have been available on mail archieves and responses I 
>got
> > >
> > >First I can
> > >
> > >  bin/nutch merge E:/STG1/index E:/STG1/indexes
> > >   bin/nutch merge E:/STG1/index E:/STG2/indexes
> > >
> > >In the nutch-site .xml I have searcher.dir ad E:/STG1
> > >
> > >I get the absolutely no results...The command console is as follows.
> > >Can some one shed some light on this please ASAP..
> > >
> > >INFO: creating new bean
> > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > >INFO: opening merged index in E:\Hoodukoo\STG5\index
> > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > >INFO: opening segments in E:\Hoodukoo\STG5\segments
> > >Apr 2, 2006 8:58:36 PM
> > >org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
> > >der
> > >INFO: found resource common-terms.utf8 at
> > >file:/C:/xampp/tomcat/webapps/hoodukoo
> > >/WEB-INF/classes/common-terms.utf8
> > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> > >INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
> > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > >INFO: query request from 127.0.0.1
> > >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> > >INFO: query: site
> > >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
> > >INFO: searching for 20 raw hits
> > >
> >
> > _________________________________________________________________
> > Express yourself instantly with MSN Messenger! Download today - it's 
>FREE!
> > http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
> >
> >

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


Re: Merging indexes -- please help....

Posted by Vertical Search <ve...@gmail.com>.
Sorry. I too have faced the same problem.. I am in process of releasing for
a demo  (mangement) over this weekend.
I will try to work on merging stuff after that... IT is a very important
part and have to get it to work, if I have to succeed in adopting Nutch for
a vertical domain.
Further more. I could not get the PruneIndexTool up and running.
It asks for query. I wonder if some can share the query file or format, the
tool expects.

But goes without saying.. I am very thankful for folks here extending the
help.

Thanks



On 4/4/06, Olive g <ol...@hotmail.com> wrote:
>
> Hi,
>
> I encountered the same problem on 0.8. See my post
> http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
> Anyone has any idea? Is it a bug or a configuration issue? Please let me
> know.
> Thanks.
>
> Olive
>
> >From: "Dan Morrill" <ra...@baker.edu>
> >Reply-To: nutch-user@lucene.apache.org
> >To: <nu...@lucene.apache.org>
> >Subject: RE: Merging indexes -- please help....
> >Date: Mon, 3 Apr 2006 05:18:34 -0700
> >
> >Hi,
> >
> >I noticed that when I used the drive designation that it didn't like that
> >(windows cygwin environment) if you did
> >
> >./nutch merge -local /STG1/index /STG1/indexes that may work better, let
> me
> >know.
> >
> >Cheers/r/dan
> >H
> >-----Original Message-----
> >From: Vertical Search [mailto:vertical.searchh@gmail.com]
> >Sent: Sunday, April 02, 2006 7:07 PM
> >To: nutch-user@lucene.apache.org
> >Subject: Re: Merging indexes -- please help....
> >
> >Okay.
> >I had 2 sets of crawl
> >such as E:/STG1 and E/STG2
> >I used the dedup command to remove duplicates
> >Then I the command i used to merge is as follows
> ><based on what have been available on mail archieves and responses I got
> >
> >First I can
> >
> >  bin/nutch merge E:/STG1/index E:/STG1/indexes
> >   bin/nutch merge E:/STG1/index E:/STG2/indexes
> >
> >In the nutch-site .xml I have searcher.dir ad E:/STG1
> >
> >I get the absolutely no results...The command console is as follows.
> >Can some one shed some light on this please ASAP..
> >
> >INFO: creating new bean
> >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> >INFO: opening merged index in E:\Hoodukoo\STG5\index
> >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> >INFO: opening segments in E:\Hoodukoo\STG5\segments
> >Apr 2, 2006 8:58:36 PM
> >org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
> >der
> >INFO: found resource common-terms.utf8 at
> >file:/C:/xampp/tomcat/webapps/hoodukoo
> >/WEB-INF/classes/common-terms.utf8
> >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
> >INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
> >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> >INFO: query request from 127.0.0.1
> >Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
> >INFO: query: site
> >Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
> >INFO: searching for 20 raw hits
> >
>
> _________________________________________________________________
> Express yourself instantly with MSN Messenger! Download today - it's FREE!
> http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
>
>

I had the same issue ... is this a bug or a configuration issue?

Posted by Olive g <ol...@hotmail.com>.
Hi,

I encountered the same problem, on 0.8. See my post 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
Anyone has any idea? Is it a bug or a configuration issue?
Thanks.

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar – get it now! 
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/


RE: Merging indexes -- please help....

Posted by Olive g <ol...@hotmail.com>.
Hi,

I encountered the same problem on 0.8. See my post 
http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04103.html.
Anyone has any idea? Is it a bug or a configuration issue? Please let me 
know.
Thanks.

Olive

>From: "Dan Morrill" <ra...@baker.edu>
>Reply-To: nutch-user@lucene.apache.org
>To: <nu...@lucene.apache.org>
>Subject: RE: Merging indexes -- please help....
>Date: Mon, 3 Apr 2006 05:18:34 -0700
>
>Hi,
>
>I noticed that when I used the drive designation that it didn't like that
>(windows cygwin environment) if you did
>
>./nutch merge -local /STG1/index /STG1/indexes that may work better, let me
>know.
>
>Cheers/r/dan
>H
>-----Original Message-----
>From: Vertical Search [mailto:vertical.searchh@gmail.com]
>Sent: Sunday, April 02, 2006 7:07 PM
>To: nutch-user@lucene.apache.org
>Subject: Re: Merging indexes -- please help....
>
>Okay.
>I had 2 sets of crawl
>such as E:/STG1 and E/STG2
>I used the dedup command to remove duplicates
>Then I the command i used to merge is as follows
><based on what have been available on mail archieves and responses I got
>
>First I can
>
>  bin/nutch merge E:/STG1/index E:/STG1/indexes
>   bin/nutch merge E:/STG1/index E:/STG2/indexes
>
>In the nutch-site .xml I have searcher.dir ad E:/STG1
>
>I get the absolutely no results...The command console is as follows.
>Can some one shed some light on this please ASAP..
>
>INFO: creating new bean
>Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
>INFO: opening merged index in E:\Hoodukoo\STG5\index
>Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
>INFO: opening segments in E:\Hoodukoo\STG5\segments
>Apr 2, 2006 8:58:36 PM
>org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
>der
>INFO: found resource common-terms.utf8 at
>file:/C:/xampp/tomcat/webapps/hoodukoo
>/WEB-INF/classes/common-terms.utf8
>Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
>INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
>Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
>INFO: query request from 127.0.0.1
>Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
>INFO: query: site
>Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
>INFO: searching for 20 raw hits
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today - it's FREE! 
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/


RE: Merging indexes -- please help....

Posted by Dan Morrill <ra...@baker.edu>.
Hi,

I noticed that when I used the drive designation that it didn't like that
(windows cygwin environment) if you did

./nutch merge -local /STG1/index /STG1/indexes that may work better, let me
know. 

Cheers/r/dan

-----Original Message-----
From: Vertical Search [mailto:vertical.searchh@gmail.com] 
Sent: Sunday, April 02, 2006 7:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: Merging indexes -- please help....

Okay.
I had 2 sets of crawl
such as E:/STG1 and E/STG2
I used the dedup command to remove duplicates
Then I the command i used to merge is as follows
<based on what have been available on mail archieves and responses I got

First I can

 bin/nutch merge E:/STG1/index E:/STG1/indexes
  bin/nutch merge E:/STG1/index E:/STG2/indexes

In the nutch-site .xml I have searcher.dir ad E:/STG1

I get the absolutely no results...The command console is as follows.
Can some one shed some light on this please ASAP..

INFO: creating new bean
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening merged index in E:\Hoodukoo\STG5\index
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening segments in E:\Hoodukoo\STG5\segments
Apr 2, 2006 8:58:36 PM
org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
der
INFO: found resource common-terms.utf8 at
file:/C:/xampp/tomcat/webapps/hoodukoo
/WEB-INF/classes/common-terms.utf8
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query request from 127.0.0.1
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query: site
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
INFO: searching for 20 raw hits


Re: Merging indexes -- please help....

Posted by Vertical Search <ve...@gmail.com>.
Okay.
I had 2 sets of crawl
such as E:/STG1 and E/STG2
I used the dedup command to remove duplicates
Then I the command i used to merge is as follows
<based on what have been available on mail archieves and responses I got

First I can

 bin/nutch merge E:/STG1/index E:/STG1/indexes
  bin/nutch merge E:/STG1/index E:/STG2/indexes

In the nutch-site .xml I have searcher.dir ad E:/STG1

I get the absolutely no results...The command console is as follows.
Can some one shed some light on this please ASAP..

INFO: creating new bean
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening merged index in E:\Hoodukoo\STG5\index
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening segments in E:\Hoodukoo\STG5\segments
Apr 2, 2006 8:58:36 PM org.apache.hadoop.conf.ConfigurationgetConfResourceAsRea
der
INFO: found resource common-terms.utf8 at
file:/C:/xampp/tomcat/webapps/hoodukoo
/WEB-INF/classes/common-terms.utf8
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean init
INFO: opening linkdb in E:\Hoodukoo\STG5\linkdb
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query request from 127.0.0.1
Apr 2, 2006 8:58:36 PM org.apache.jsp.search_jsp _jspService
INFO: query: site
Apr 2, 2006 8:58:36 PM org.apache.nutch.searcher.NutchBean search
INFO: searching for 20 raw hits

Re: Merging indexes -- please help....

Posted by Vertical Search <ve...@gmail.com>.
Thank You Zaheed. I am using 0.8 dev and windows cygwin.
Let me try this.

Thanks


On 3/28/06, Zaheed Haque <za...@gmail.com> wrote:
>
> I am guessing you have used 0.8-dev, I am not sure how things are in 0.7
>
> bin/nutch index <index> <crawldb> <linkdb> <segment>
>
> where <index> = name of your index directory! NOTE you cannot call it
> "index" (Please correct me if I am wrong). This is why it is good to
> call it "indexes" or "id"
> <crawldb> = name of your crawldb directory
> <linkdb> = name of your linkdb directory
> <segment> = name of your segment directory! not "segments" directory
> meaning  "segments/200612120937"  or something similler.
>
> Now you need to index all your segments using the command above. And
> the <index> folder should be the same but the segment folder will
> change offcourse.
> Example
>
> >bin/nutch index indexes crawldb linkdb segment/200612029232
> >bin/nutch index indexes crawldb linkdb segment/200612022212
>
> Once the indexing is done for all the fetched segments then try the
> following
>
> >bin/nutch merge index indexes
>
> Note "index" above is where the merged index is placed for searching
> (i.e. output index) and "indexes" folder is where your segments index
> are.
>
> Hope this helps. I don't know how things are in windows with cygwin.
>
>
> On 3/27/06, Vertical Search <ve...@gmail.com> wrote:
> > So far, I have been able to work through minor obstacles in setting up
> Nutch
> > for a vertical search.
> > Now am kind of stuck from past 24 hours to merge the indexes.
> > I have crawled multiple sites. but want to merge the indexes.
> > I am using merge command as follows
> >
> >
> > bin/nutch merge C:/vSearch/DB/index C:/global_search/index
> >
> > All it says is "Adding C:/vSearch/DB/index" but the indexed data itself
> is
> > not consolidated and merged.
> >
> > Can some one point me to the correct mail in archieve or help me get
> over
> > this problem...
> > Another question is in IndexMerger, I see -workingdir as a flag. tried
> with
> > that too with no avail...
> >
> > Please.......... help..
> >
> > Thanks
> >
>

Re: Merging indexes -- please help....

Posted by Zaheed Haque <za...@gmail.com>.
I am guessing you have used 0.8-dev, I am not sure how things are in 0.7

bin/nutch index <index> <crawldb> <linkdb> <segment>

where <index> = name of your index directory! NOTE you cannot call it
"index" (Please correct me if I am wrong). This is why it is good to
call it "indexes" or "id"
<crawldb> = name of your crawldb directory
<linkdb> = name of your linkdb directory
<segment> = name of your segment directory! not "segments" directory
meaning  "segments/200612120937"  or something similler.

Now you need to index all your segments using the command above. And
the <index> folder should be the same but the segment folder will
change offcourse.
Example

>bin/nutch index indexes crawldb linkdb segment/200612029232
>bin/nutch index indexes crawldb linkdb segment/200612022212

Once the indexing is done for all the fetched segments then try the following

>bin/nutch merge index indexes

Note "index" above is where the merged index is placed for searching
(i.e. output index) and "indexes" folder is where your segments index
are.

Hope this helps. I don't know how things are in windows with cygwin.


On 3/27/06, Vertical Search <ve...@gmail.com> wrote:
> So far, I have been able to work through minor obstacles in setting up Nutch
> for a vertical search.
> Now am kind of stuck from past 24 hours to merge the indexes.
> I have crawled multiple sites. but want to merge the indexes.
> I am using merge command as follows
>
>
> bin/nutch merge C:/vSearch/DB/index C:/global_search/index
>
> All it says is "Adding C:/vSearch/DB/index" but the indexed data itself is
> not consolidated and merged.
>
> Can some one point me to the correct mail in archieve or help me get over
> this problem...
> Another question is in IndexMerger, I see -workingdir as a flag. tried with
> that too with no avail...
>
> Please.......... help..
>
> Thanks
>